[net2-wg] ctp forwader bug
Philip Levis
pal at cs.stanford.edu
Mon Aug 17 06:49:04 PDT 2009
On Aug 17, 2009, at 1:46 AM, Omprakash Gnawali wrote:
> On Sat, Aug 15, 2009 at 11:12 AM, Omprakash Gnawali<gnawali at usc.edu>
> wrote:
>> On Sat, Aug 15, 2009 at 6:47 AM, Philip Levis<pal at cs.stanford.edu>
>> wrote:
>>> On Aug 14, 2009, at 12:17 AM, Omprakash Gnawali wrote:
>>>
>>>> - ctp bugs/cleanup
>>>
>>> (written on a plane to London...)
>>>
>>> I'm sending some details because I won't be able to test code for
>>> the next
>>> week (I have no motes). So I'm putting down what I've figured out.
>>> Maybe
>>> this will be enough for someone else to figure it out, or I can
>>> work on it
>>> in two weeks.
>>>
>>> Looking through the logs Om sent, it seems the problem is
>>> acknowledgments.
>>> Specifically, acks aren't working so nodes are retransmitting
>>> many, many
>>> times. For example, here's part of a log from node 3:
>>>
>>> 1250272888.261593 (241): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>> 1250272888.277591 (242): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>> 1250272888.309590 (243): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>> 1250272888.341088 (244): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>> 1250272888.373085 (245): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>> 1250272888.389085 (246): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>> 1250272888.436582 (247): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>> 1250272888.453081 (248): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>> 1250272888.564706 (249): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>> 1250272888.644066 (250): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>> 1250272888.644138 (251): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>> 1250272888.671064 (252): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>> 1250272888.671125 (253): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>> 1250272888.671204 (254): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>> 1250272888.671263 (255): NET_C_TREE_NEW_PARENT 31 30 43 86
>>>
>>> (brief referesher: WAITACK 25 11 3 0 means this is a unacked send
>>> done,
>>> which is code 25, for packet #11 from node 3, with a next hop of
>>> zero, the
>>> root)
>>>
>>> Meanwhile, the root has:
>>>
>>> 1250272888.249094 (1031): NET_C_FE_RCV_MSG 21 11 3 1
>>> 1250272888.281088 (1032): NET_C_FE_RCV_MSG 21 11 3 1
>>> 1250272888.313086 (1033): NET_C_FE_RCV_MSG 21 11 3 1
>>> 1250272888.329088 (1034): NET_C_FE_RCV_MSG 21 11 3 1
>>> 1250272888.361084 (1035): NET_C_FE_RCV_MSG 21 11 3 1
>>> 1250272888.393081 (1036): NET_C_FE_RCV_MSG 21 11 3 1
>>> 1250272888.424080 (1037): NET_C_FE_RCV_MSG 21 11 3 1
>>> 1250272888.495139 (1040): NET_C_FE_RCV_MSG 21 11 3 1
>>> 1250272888.548071 (1041): NET_C_FE_RCV_MSG 21 11 3 1
>>> 1250272888.548307 (1043): NET_C_FE_RCV_MSG 21 11 3 1
>>> 1250272888.563073 (1045): NET_C_FE_RCV_MSG 21 11 3 1
>>> 1250272888.592218 (1047): NET_C_FE_RCV_MSG 21 11 3 1
>>> 1250272888.634663 (1049): NET_C_FE_RCV_MSG 21 11 3 1
>>> 1250272892.631463 (1081): NET_C_FE_RCV_MSG 21 12 3 2
>>>
>>> So the root is receiving all of these transmissions but for some
>>> reason
>>> they're not being acked.
>>>
>>> Here is the relevant code from sendTask (lines 440-461):
>>>
>>> else {
>>> /* The basic forwarding/sending case. */
>>> call CtpPacket.setEtx(qe->msg, gradient);
>>> call CtpPacket.clearOption(qe->msg, CTP_OPT_ECN |
>>> CTP_OPT_PULL);
>>> if (call PacketAcknowledgements.requestAck(qe->msg) ==
>>> SUCCESS) {
>>> setState(ACK_PENDING);
>>> }
>>> else {
>>> clearState(ACK_PENDING);
>>> }
>>> if (hasState(QUEUE_CONGESTED)) {
>>> call CtpPacket.setOption(qe->msg, CTP_OPT_ECN);
>>> clearState(QUEUE_CONGESTED);
>>> }
>>>
>>> subsendResult = call SubSend.send(dest, qe->msg, payloadLen);
>>> if (subsendResult == SUCCESS) {
>>> // Successfully submitted to the data-link layer.
>>> setState(SENDING);
>>> dbg("CtpForwarder", "%s: subsend succeeded with %p.\n",
>>> __FUNCTION__, qe->msg);
>>> return;
>>> }
>>>
>>> Here's the relevant code from sendDone (lines 501-520):
>>>
>>> else if (hasState(ACK_PENDING) && !call
>>> PacketAcknowledgements.wasAcked(msg)) {
>>> /* No ack: if countdown is not 0, retransmit, else drop the
>>> packet. */
>>> call LinkEstimator.txNoAck(call AMPacket.destination(msg));
>>> call CtpInfo.recomputeRoutes();
>>> if (--qe->retries) {
>>> dbg("CtpForwarder", "%s: not acked, retransmit\n",
>>> __FUNCTION__);
>>> call CollectionDebug.logEventMsg(NET_C_FE_SENDDONE_WAITACK,
>>> call
>>> CollectionPacket.getSequenceNumber(msg),
>>> call
>>> CollectionPacket.getOrigin(msg),
>>> call
>>> AMPacket.destination(msg));
>>> startRetxmitTimer(SENDDONE_NOACK_WINDOW,
>>> SENDDONE_NOACK_OFFSET);
>>> } else {
>>> /* Hit max retransmit threshold: drop the packet. */
>>> call SendQueue.dequeue();
>>> clearState(SENDING);
>>> startRetxmitTimer(SENDDONE_OK_WINDOW, SENDDONE_OK_OFFSET);
>>>
>>> sendComplete(qe, msg, FALSE);
>>> }
>>> }
>>>
>>> What I would do to dig deeper is first determine if the ack
>>> failure is on
>>> the send or sendDone. That is, is the ack bit of the transmitted
>>> packet not
>>> set, such that the receiver does not generate an ack, or is that
>>> an ack is
>>> generated but for some reason wasAcked isn't returning the right
>>> thing.
>>>
>>> Or there could be some trivial bug in the code above that I can't
>>> find
>>> (e.g., an || instead of an &&, etc.).
>>>
>>> Hope this helps...
>>>
>>> Om, I would recommend reverting to 1.18 until we figure this out.
>>
>>
>> Walked through 18-22. Reverted to 21 because that seems to be
>> working;
>> the bug was introduced in 22.
>
>
> With 22, I noticed that the network runs into the fail ack send
> problem after a set amount of time. It happens simultaneously across
> the network:
> http://enl.usc.edu/~om_p/net2/ctpnew-1.22b/
> http://enl.usc.edu/~om_p/net2/ctpnew-1.22/
There seems to be a discrepancy between the logs and the plots. E.g.,
for ctpnew-1.22, I see a few WAIT_ACK events for node 5 at the
beginning of the plot, then 10 (t=600) minutes in the number spikes.
But the log only has 37 seconds of data....
aeolus /tmp/ctpnew-1.22 -11-> head -1 messages-5.txt
1250455707.087995 (237): NET_C_FE_SENDDONE_WAITACK 25 6 5 0
aeolus /tmp/ctpnew-1.22 -12-> tail -1 messages-5.txt
1250455744.260685 (474): NET_C_FE_SENDQUEUE_EMPTY 50 0 0 0
What am I missing here?
Phil
More information about the net2-wg
mailing list