Jump to content

RSS feed reader :)

- - - - -

  • Please log in to reply
8 replies to this topic

#1
reilly

reilly

    Newbie

  • Members
  • PipPip
  • 11 posts
Ok, this is my attempt at a program that reads rss feeds. So far it is able to download the html of the feed to a buffer and check for tags as it goes. This will be used to find a new item e.g. <origLink> but my issue is that it doesnt get the entire page html... Tried it on a few diffrent sites and it has the same issue, i think it has worked on like one.. :cursing::cursing:

#include <winsock2.h>
#include <windows.h>
#include <iostream>
#include <string>
#pragma comment(lib,"ws2_32.lib")

using namespace std;

int main (){
    WSADATA wsaData;

    if (WSAStartup(MAKEWORD(2,2), &wsaData) != 0) {
        cout << "WSAStartup failed." << endl;
        cin.get();
        return 1;
    }

    string requestSend;
    string sendHost = "feeds.feedburner.com";
    string sendDirectory = "/failblog?format=xml";

    requestSend += "GET ";
    requestSend += sendDirectory;
    requestSend +=" HTTP/1.1\r\n";
    requestSend += "Host: ";
    requestSend +=    sendHost;
    requestSend += "\r\n";
    requestSend += "Connection: close\r\n";
    requestSend += "\r\n";

    SOCKET Socket=socket(AF_INET,SOCK_STREAM,IPPROTO_TCP);

    struct hostent *host;
    host = gethostbyname(sendHost.c_str());

    SOCKADDR_IN SockAddr;
    SockAddr.sin_port=htons(80);
    SockAddr.sin_family=AF_INET;
    SockAddr.sin_addr.s_addr = *((unsigned long*)host->h_addr);

    cout << "Connecting..." << endl;;
    if(connect(Socket,(SOCKADDR*)(&SockAddr),sizeof(SockAddr)) == SOCKET_ERROR ){ // or you could have !=0
        cout << "Could not connect";
        cin.get();
        return 1;
    }
    cout << "Connected." << endl;

    send(Socket,requestSend.c_str(), requestSend.length(),0);
    char buffer[10000];
    int dataLength = recv(Socket,(char*)&buffer,sizeof(buffer),0);
    int i = 0;
    char check1[] = {'c', 'h', 'e', 'e', 'z', 'b', 'u', 'r'}; //the string it searchers the end of the buffer for
    //char check1[] = {'o', 'r', 'i', 'g', 'L', 'i', 'n', 'k'};

    int checked;
    while(i < dataLength)
    {
                checked = 1; // reset to 1 each time
            cout << buffer[i];
            
            for(int x = 0; x<=7; x++){
                if(buffer[i-7+x] != check1[x]){
                    checked = 0;
                    break; //efficiency
                }
            }
            if(checked){cout << "<---MATCH!";} //if the last 8 values of buffer are the same as the first 8 of check1
            i ++;
    }

    cout << endl;
    cout << "Size: " << i << endl;
    cout << "datalength: " << dataLength << endl;
    cout << "Size of buffer: " << sizeof(buffer) << endl;
    cout << "Request length: " << requestSend.length() << endl;

    closesocket(Socket);
        WSACleanup();

    cin.get();
    return 0;
}
Thanks in advance if anyone is able to help :crying:

Edited by reilly, 17 September 2010 - 08:37 PM.
added a few comments


#2
dbug

dbug

    Programmer

  • Members
  • PipPipPipPip
  • 155 posts
The entire HTML page is larger than 10000 bytes, so you only get the first fragment of the page. You should add a loop reading data until dataLength is 0 (end of data) or it is SOCKET_ERROR. There is a special case you should take care of: if the string you are searching crosses the block boundary of 10000 bytes.

There is a bug in your comparison code: if (buffer[i-7+x] != check1[x]) is accessing a bad memory area for the first 7 bytes of each block. It should be: if (buffer[i+x] != check1[x]) and the while loop should end at dataLength - 7 (it is a bit more complex if you need to analyze multiple blocks).

As a side note, I would use sizeof(check1) instead of a fixed number. It's easier to make changes later.

#3
reilly

reilly

    Newbie

  • Members
  • PipPip
  • 11 posts
the 10000 size buffer was not going to be permanent, all the things like not using sizeof are just untill i get it working.
btw it doesnt stop reading at 10000 bytes but at like 4000.
also my while(i < dataLength) loop goes through untill the end of dataLength is reached (because dataLength is the length of buffer[i]).
and that sizeof(check1) would need to be sizeof(check1)-1 because of the null char at then end.

also are you sure there is an error with the check? it works fine. what does is get the end of buffer[i] and goes back 7, then checks it with the first one in check1, then back 6 in buffer[i] and checks that with the second one in check1 etc....

but with all that i still cant get it to download the entire html. :crying::confused:

#4
dbug

dbug

    Programmer

  • Members
  • PipPipPipPip
  • 155 posts
sizeof(check1) does not count the null terminator because you haven't defined it (the way you have defined check1 doesn't include the null character).

I've tried your code and I get 10000 bytes at the first read, not 4000. Anyway, the buffer size is not a requirement for the recv buffer. It might return less data than solicited. This doesn't mean that there isn't more data. You must continue reading from the socket until 0 or SOCKET_ERROR is returned. This can happen when the recv call is made before all server data is received (maybe due to a slow connection or server).

Regarding dataLength, if you initialize i with 0, then the first check on the buffer will be at index i-7+x = -7. This is outside the buffer. If you change the comparison as I mentioned, then the first check will be at index 0, however, the last check will be at index 9999+7 = 10006, that is also outside the buffer. That's why you need to stop at dataLength - 7.

#5
dbug

dbug

    Programmer

  • Members
  • PipPipPipPip
  • 155 posts
If you want to be sure that you receive less that 10000 bytes in the first read due to the reason I said earlier, you can add a Sleep(1000) call before calling recv. This will wait a second before trying to read anything.

#6
reilly

reilly

    Newbie

  • Members
  • PipPip
  • 11 posts
Yes you are correct dbug!

I still cant seem to get this part to work though.

while((recv(Socket,buffer,sizeof(buffer),0)) != 0)
    {
        i = 0;
        while(buffer[i] > 0)
        {
            cout << buffer[i]; 
            i++;
        }
    }

im guess that i needs to be reset to 0 before reading it again as it replaces what was in buffer with the new chunk?

#7
dbug

dbug

    Programmer

  • Members
  • PipPipPipPip
  • 155 posts
Checking if buffer[i] > 0 is not a valid condition to determine the end of the buffer.

I think this should work in this case:

    char buffer[10000];

    char check1[] = {'c', 'h', 'e', 'e', 'z', 'b', 'u', 'r'};

    int dataLength;

    int p = 0;

    int pos = 0;

    int count = 0;

    while ((dataLength = recv(Socket, buffer, sizeof(buffer), 0)) > 0)

    {

        int i = 0;

        do

        {

            pos++;

            if (buffer[i++] == check1[p])

            {

                if (++p >= sizeof(check1))    // assuming no null termination in check1

                {

                    cout << "Match found. Current offset: " << pos << endl;

                    count++;

                    p = 0;

                }

            }

            else

            {

                p = 0;

            }

        } while (--dataLength);

    }


    cout << count << " matches found" << endl;
This code cannot detect words that appear just after a fragment of the same word. For example, if you search 'cheezbur' and the text contains 'chcheezbur', it won't detect it. I don't think this is necessary for this special case, however it can be done by adding some complexity.

#8
reilly

reilly

    Newbie

  • Members
  • PipPip
  • 11 posts
in your code doesn't it ignore the very first byte?

while ((dataLength = recv(Socket, buffer, sizeof(buffer), 0)) > 0)
    {
        int i = 0;
        do
        {
            if(buffer[i] == check1[p]){//current buffer is equal to first in check1
                p++; //ready to check the second check1
                if(p == sizeof(check1)){
                    cout << "<-- MATCH!!";
                }
            }
            else{
                p=0;//not equal so return to 0
                if(buffer[i] == check1[p]){//could be the start of a new one e.g. chcheez
                    p++;
                }
            }
            pos++;i++;
        
            cout << buffer[i];
        } while (dataLength--);
    }
this also deals with the chcheezbur issue i think

ok this grabs the link of each new post. Thank you soo much dbug for clearing some issues up :D
I should be able to finish this my self for now.

    char buffer[10000];
    char check1[] = {'<','f','e','e','d','b','u','r','n','e','r',':','o','r','i','g','L','i','n','k','>'};
    int dataLength;
    int p = 0;
    int pos = 0;
    int count = 0;
    while ((dataLength = recv(Socket, buffer, sizeof(buffer), 0)) > 0)
    {
        int i = 0;
        do
        {
            if(buffer[i] == check1[p]){//current buffer is equal to first in check1
                p++; //ready to check the second check1
                if(p == sizeof(check1)){//a full match is found
                    p=0;
                    i++;//so takes the data AFTER the found string
                    while(buffer[i] != '<'){//could error if there is a '<' tag in the name....
                        cout << buffer[i];
                        i++;
                    }
                    cout << endl;
                }
            }
            else{
                p=0;//not equal so return to 0
                if(buffer[i] == check1[p]){//could be the start of a new one e.g. chcheez
                    p++;
                }
            }
            pos++;i++;
        } while (dataLength--);
    }

Edited by reilly, 21 September 2010 - 10:24 PM.


#9
dbug

dbug

    Programmer

  • Members
  • PipPipPipPip
  • 155 posts
No, my code handles the first byte of each block correctly (i++ increments i after having indexed the buffer). The exit condition of do {} while loop must be --dataLength. It's not the same writing the '--' before or after the variable name. If it's before, then the variable is decremented and then tested, in the other case, the variable is tested and then decremented. Using '--' after the variable name causes an extra loop iteration which obviously is not correct.

Your code can handle things like 'chcheez', but it can be more complex with other words. For example 'coconut' will not be detected if 'cococonut' is present in the page. However, as I said, in this specific case all this logic is not necessari because these things won't happen.




1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users