[net2-wg] ctp forwader bug
Omprakash Gnawali
gnawali at usc.edu
Mon Aug 17 10:34:25 PDT 2009
On Mon, Aug 17, 2009 at 6:49 AM, Philip Levis<pal at cs.stanford.edu> wrote:
>
> On Aug 17, 2009, at 1:46 AM, Omprakash Gnawali wrote:
>
>> On Sat, Aug 15, 2009 at 11:12 AM, Omprakash Gnawali<gnawali at usc.edu>
>> wrote:
>>>
>>> On Sat, Aug 15, 2009 at 6:47 AM, Philip Levis<pal at cs.stanford.edu> wrote:
>>>>
>>>> On Aug 14, 2009, at 12:17 AM, Omprakash Gnawali wrote:
>>>>
>>>>> - ctp bugs/cleanup
>>>>
>>>> (written on a plane to London...)
>>>>
>>>> I'm sending some details because I won't be able to test code for the
>>>> next
>>>> week (I have no motes). So I'm putting down what I've figured out. Maybe
>>>> this will be enough for someone else to figure it out, or I can work on
>>>> it
>>>> in two weeks.
>>>>
>>>> Looking through the logs Om sent, it seems the problem is
>>>> acknowledgments.
>>>> Specifically, acks aren't working so nodes are retransmitting many, many
>>>> times. For example, here's part of a log from node 3:
>>>>
>>>> 1250272888.261593 (241): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>>> 1250272888.277591 (242): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>>> 1250272888.309590 (243): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>>> 1250272888.341088 (244): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>>> 1250272888.373085 (245): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>>> 1250272888.389085 (246): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>>> 1250272888.436582 (247): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>>> 1250272888.453081 (248): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>>> 1250272888.564706 (249): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>>> 1250272888.644066 (250): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>>> 1250272888.644138 (251): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>>> 1250272888.671064 (252): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>>> 1250272888.671125 (253): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>>> 1250272888.671204 (254): NET_C_FE_SENDDONE_WAITACK 25 11 3 0
>>>> 1250272888.671263 (255): NET_C_TREE_NEW_PARENT 31 30 43 86
>>>>
>>>> (brief referesher: WAITACK 25 11 3 0 means this is a unacked send done,
>>>> which is code 25, for packet #11 from node 3, with a next hop of zero,
>>>> the
>>>> root)
>>>>
>>>> Meanwhile, the root has:
>>>>
>>>> 1250272888.249094 (1031): NET_C_FE_RCV_MSG 21 11 3 1
>>>> 1250272888.281088 (1032): NET_C_FE_RCV_MSG 21 11 3 1
>>>> 1250272888.313086 (1033): NET_C_FE_RCV_MSG 21 11 3 1
>>>> 1250272888.329088 (1034): NET_C_FE_RCV_MSG 21 11 3 1
>>>> 1250272888.361084 (1035): NET_C_FE_RCV_MSG 21 11 3 1
>>>> 1250272888.393081 (1036): NET_C_FE_RCV_MSG 21 11 3 1
>>>> 1250272888.424080 (1037): NET_C_FE_RCV_MSG 21 11 3 1
>>>> 1250272888.495139 (1040): NET_C_FE_RCV_MSG 21 11 3 1
>>>> 1250272888.548071 (1041): NET_C_FE_RCV_MSG 21 11 3 1
>>>> 1250272888.548307 (1043): NET_C_FE_RCV_MSG 21 11 3 1
>>>> 1250272888.563073 (1045): NET_C_FE_RCV_MSG 21 11 3 1
>>>> 1250272888.592218 (1047): NET_C_FE_RCV_MSG 21 11 3 1
>>>> 1250272888.634663 (1049): NET_C_FE_RCV_MSG 21 11 3 1
>>>> 1250272892.631463 (1081): NET_C_FE_RCV_MSG 21 12 3 2
>>>>
>>>> So the root is receiving all of these transmissions but for some reason
>>>> they're not being acked.
>>>>
>>>> Here is the relevant code from sendTask (lines 440-461):
>>>>
>>>> else {
>>>> /* The basic forwarding/sending case. */
>>>> call CtpPacket.setEtx(qe->msg, gradient);
>>>> call CtpPacket.clearOption(qe->msg, CTP_OPT_ECN | CTP_OPT_PULL);
>>>> if (call PacketAcknowledgements.requestAck(qe->msg) == SUCCESS) {
>>>> setState(ACK_PENDING);
>>>> }
>>>> else {
>>>> clearState(ACK_PENDING);
>>>> }
>>>> if (hasState(QUEUE_CONGESTED)) {
>>>> call CtpPacket.setOption(qe->msg, CTP_OPT_ECN);
>>>> clearState(QUEUE_CONGESTED);
>>>> }
>>>>
>>>> subsendResult = call SubSend.send(dest, qe->msg, payloadLen);
>>>> if (subsendResult == SUCCESS) {
>>>> // Successfully submitted to the data-link layer.
>>>> setState(SENDING);
>>>> dbg("CtpForwarder", "%s: subsend succeeded with %p.\n",
>>>> __FUNCTION__, qe->msg);
>>>> return;
>>>> }
>>>>
>>>> Here's the relevant code from sendDone (lines 501-520):
>>>>
>>>> else if (hasState(ACK_PENDING) && !call
>>>> PacketAcknowledgements.wasAcked(msg)) {
>>>> /* No ack: if countdown is not 0, retransmit, else drop the packet.
>>>> */
>>>> call LinkEstimator.txNoAck(call AMPacket.destination(msg));
>>>> call CtpInfo.recomputeRoutes();
>>>> if (--qe->retries) {
>>>> dbg("CtpForwarder", "%s: not acked, retransmit\n", __FUNCTION__);
>>>> call CollectionDebug.logEventMsg(NET_C_FE_SENDDONE_WAITACK,
>>>> call
>>>> CollectionPacket.getSequenceNumber(msg),
>>>> call
>>>> CollectionPacket.getOrigin(msg),
>>>> call AMPacket.destination(msg));
>>>> startRetxmitTimer(SENDDONE_NOACK_WINDOW, SENDDONE_NOACK_OFFSET);
>>>> } else {
>>>> /* Hit max retransmit threshold: drop the packet. */
>>>> call SendQueue.dequeue();
>>>> clearState(SENDING);
>>>> startRetxmitTimer(SENDDONE_OK_WINDOW, SENDDONE_OK_OFFSET);
>>>>
>>>> sendComplete(qe, msg, FALSE);
>>>> }
>>>> }
>>>>
>>>> What I would do to dig deeper is first determine if the ack failure is
>>>> on
>>>> the send or sendDone. That is, is the ack bit of the transmitted packet
>>>> not
>>>> set, such that the receiver does not generate an ack, or is that an ack
>>>> is
>>>> generated but for some reason wasAcked isn't returning the right thing.
>>>>
>>>> Or there could be some trivial bug in the code above that I can't find
>>>> (e.g., an || instead of an &&, etc.).
>>>>
>>>> Hope this helps...
>>>>
>>>> Om, I would recommend reverting to 1.18 until we figure this out.
>>>
>>>
>>> Walked through 18-22. Reverted to 21 because that seems to be working;
>>> the bug was introduced in 22.
>>
>>
>> With 22, I noticed that the network runs into the fail ack send
>> problem after a set amount of time. It happens simultaneously across
>> the network:
>> http://enl.usc.edu/~om_p/net2/ctpnew-1.22b/
>> http://enl.usc.edu/~om_p/net2/ctpnew-1.22/
>
> There seems to be a discrepancy between the logs and the plots. E.g., for
> ctpnew-1.22, I see a few WAIT_ACK events for node 5 at the beginning of the
> plot, then 10 (t=600) minutes in the number spikes. But the log only has 37
> seconds of data....
>
> aeolus /tmp/ctpnew-1.22 -11-> head -1 messages-5.txt
> 1250455707.087995 (237): NET_C_FE_SENDDONE_WAITACK 25 6 5 0
> aeolus /tmp/ctpnew-1.22 -12-> tail -1 messages-5.txt
> 1250455744.260685 (474): NET_C_FE_SENDQUEUE_EMPTY 50 0 0 0
>
> What am I missing here?
The tarball was not up-to-date. Here is the URL:
http://enl.usc.edu/~om_p/net2/ctpnew-1.22.tar.gz
http://enl.usc.edu/~om_p/net2/ctpnew-1.22b.tar.gz
There are a few new debug messages (0xae, etc), but here is the file:
http://enl.usc.edu/~om_p/net2/CtpForwardingEngineP.nc.22
- om_p
More information about the net2-wg
mailing list