Jump to content




Recent Status Updates

  • Photo
      15 Nov
    duzamucha

    Hi, I am final year Interior Design Student from University of Huddersfield. I am currently working on my final major project which is going to be linked to coding. I was hoping that you could help me with my research. I have prepared a short survey, it would be a massive help if you could fill it in for me. It takes less than 2 minutes to complete, I promise. Here is the link: https://www.surveymonkey.com/s/73XLJKK Thank you so much in advance!

View All Updates

Developed by TechBiz Xccelerator
Photo
- - - - -

Parsing CSV

csv

  • Please log in to reply
9 replies to this topic

#1 spadez

spadez

    CC Regular

  • Just Joined
  • PipPipPip
  • 38 posts

Posted 03 February 2009 - 02:42 PM

Hi,

Ive been working on my code to read a CSV file in a C program (without including C+, C# or C++ code). Here is the code below:

#include <stdio.h>

int main(int argc, char ** argv){
  int c;
  FILE * fp;

  if(argc < 2){
    printf("Usage:\n\t%s filename\n",argv[0]);
    return -1;
  }

  if((fp = fopen(argv[1],"rb")) == NULL){
    printf("can't open %s\n",argv[1]);
    return -2;
  }

  while((c = fgetc(fp)) != EOF){
    /* TO DO- Put your code to parse the csv here char by char
     */
    printf("%c",c);
  }

  fclose(fp);
  return 0;
}

The next step is to parse the CSV file contents. Has anyone had any experience in doing this?

James
  • 0

#2 Lance

Lance

    CC Addict

  • Advanced Member
  • PipPipPipPipPip
  • 270 posts

Posted 03 February 2009 - 03:05 PM

Your thought?

if speed is not an issue, you can ideed do it that way:

  while((c = fgetc(fp)) != EOF){
      switch (c ){
       case '"':
       case '\'': /* can ' start a quoted string? */
           parse_a_quoted_string(c);
           break;
       case ',':
            field_ended();
            break;
       case '\n':
            record_ended(); /* you should also check if the last character is a '\r', if yes, it's
                                         * to be removed 
                                         */
           break;
        default:
             add_a_char_to_current_field(c);

      }
   }


But before you move on the define those subfunctions, what do you think should be the data structure used? Dynamically allocating memory will be a must.
  • 0

#3 Lance

Lance

    CC Addict

  • Advanced Member
  • PipPipPipPipPip
  • 270 posts

Posted 03 February 2009 - 03:10 PM

A simple data structure would be:

struct _CSV_Record
{
unsigned cnt_of_fields;
char ** field_values;
};

typedef _CSV_Record CSV_Record;

struct _CSV_Table
{
unsigned record_cnt;
CSV_Record *records;
};

Depending on how you plan to access the data, you can also use linked list (than access time would be linear instead of constant when using array as above), the action to add a new records or a new field will be greatly simplified.
  • 0

#4 spadez

spadez

    CC Regular

  • Just Joined
  • PipPipPip
  • 38 posts

Posted 03 February 2009 - 04:25 PM

Hi,

Thank you for the reply.

Your parsing looks great. Since this is a only going to parse a small file and time isnt an issue, I could keep the code as is, although if it can be optimised then that would be great.

Im afriad I dont really understand the "data structure" as im very new to this. What exactly is a data structure and how would it be intergrated?

P.s I dont need to edit the file, only read from it.
  • 0

#5 Lance

Lance

    CC Addict

  • Advanced Member
  • PipPipPipPipPip
  • 270 posts

Posted 03 February 2009 - 06:17 PM

If that's the case, I guess you can use linked list.

OK, you parse the CSV file, but if you cannot store the result of your parsing in one way or another, there would be of little use.

For example, you CSV may be of the format:

Name, hourly_rate, Bi_weekly_hours


You will probably be requested to do certain calculation, search, etc based on the data. To avoid the necessity to parse the CSV file every once in a while, you would be interested in store in the memory. The most intuitive way to interpret a CSV table is that it's a vector (array) of vectors (arrays) of fields.

OK, in the middle of the typing, I decided it's still most logical to use array instead of linked list.

I just typed the following code, it's not tested

#define MAX_LINE_LEN   1024*512  /* 1/2 mega byte, should be more than sufficient */
.....

    char line[MAX_LINE_LEN];
   int len=0;
   int cnt_of_fields=0;
   char *p;
   while( (c=GET_A_CHAR_FROM_FILE(fp))!=EOF){
           switch(c)
           {
           case '"':
           case '\'':
                    /* parse a quoted string, ignore it for now */
                  break; 
           case ',':
                  ++cnt_of_fields; /* a comma signal end of previous field and begining of next fields */
                  line[len++]='\0';
                  if(len==MAX_LINE_LEN){
                           fprintf(stderr, "Line too long\n");
                           exit(-1);
                  }             
                 break;
           case '\n':
                ++cnt_of_fields; /* a EOL is end of record, and at the same time end of field */
                  line[len++]='\0';
                  
                 /* make a copy of the line in the heap, note strdup or strcpy won't work in our case */
                p = (char *)malloc(len); 
                memcpy(p, line, len); /* now all the fields in the record are stored in p[ ] */ 
                add_a_record( p );
                break;
           default:
                line[len++] = c;
                  if(len==MAX_LINE_LEN){
                           fprintf(stderr, "Line too long\n");
                           exit(-1);
                  }             
           }
   }



Hope you understand what I mean.

Edited by Lance, 03 February 2009 - 06:22 PM.
!

  • 0

#6 Lance

Lance

    CC Addict

  • Advanced Member
  • PipPipPipPipPip
  • 270 posts

Posted 03 February 2009 - 06:20 PM

You still need to find a way to organize records. I would say, in the parsing stage, a linked list is a good option. They after parsing, you can always changed it to an array (as now you have the number of records, it would be very easy to allocate required memory). Then get_value(int record_no, int field_no) would take constant speed.
  • 0

#7 spadez

spadez

    CC Regular

  • Just Joined
  • PipPipPip
  • 38 posts

Posted 04 February 2009 - 07:52 AM

Hi Lance,

Thank you for the info and code. Im going to have a read of it today and try to understand why it is coded the way that it is, but its a big help to get me on my way.
  • 0

#8 Lance

Lance

    CC Addict

  • Advanced Member
  • PipPipPipPipPip
  • 270 posts

Posted 14 April 2009 - 10:37 AM

Hi spadez:

You requested assistance in a recent private message. Please try the following code

csv_parser.h
// file: csv_parser.h

#ifndef _CSV_PARSER_H_
#define _CSV_PARSER_H_

#include <stdio.h>  // for fopen, fclose, etc.


#define MAX_LINE_LEN (1024*512)
#define MAX_COLUMN_COUNT 1024
/* digest from CSV wiki: http://en.wikipedia.org/wiki/Comma-separated_values

    * fields that contain commas, double-quotes, or line-breaks must be quoted,
    * a quote within a field must be escaped with an additional quote
      immediately preceding the literal quote,
    * space before and after delimiter commas may be trimmed (which is
      prohibited by RFC 4180), and
    * a line break within an element must be preserved.
*/

enum { E_LINE_TOO_WIDE=-2, // error code for line width >= MAX_LINE_LEN
       E_QUOTED_STRING     // error code for ill-formatted quoted string
};

// mimic sqlite callback interface
//
typedef int (*CSV_CB_record_handler)
(
    void * params,
    int colum_cnt,
    const char ** column_values
);

int csv_parse (FILE *fp, CSV_CB_record_handler cb, void *params);

#endif

csv_parser.c
// file: csv_parser.c
//

#include "csv_parser.h"
#include <ctype.h>   // for isspace

// private:
struct csv_parser_data
{
    CSV_CB_record_handler callback;
    void                * params;
    char                  buff[MAX_LINE_LEN];
    int                   field_count;
    const char *          column_values[MAX_COLUMN_COUNT];
};

/* digest form CSV wiki http://en.wikipedia.org/wiki/Comma-separated_values

    * fields that contain commas, double-quotes, or line-breaks must be quoted,
    * a quote within a field must be escaped with an additional quote
      immediately preceding the literal quote,
    * space before and after delimiter commas may be trimmed (which is
      prohibited by RFC 4180), and
    * a line break within an element must be preserved.
*/



// returns 0 on success, E_QUOTED_STRING ON improperly quoted string
//
// pre-condition: **buff points to the beginning quote
// post-condition: **buff points to just before either a comma,
//              or a newline, or E_QUOTED_STRING is returned.
//
static int csv_process_quoted_string(char **buff)
{
    char * q = *buff;
    char * p = q++;
    while (1)
    {
        switch(*q)
        {
        case '\0': // end of line in Quoted String, it's an error
            return E_QUOTED_STRING;
        case '"': // if the next char is not a '"', the QuotedString ends.
            if(*++q!='"')
                goto done_quoted_string;
            // here we deliberately let the else case fall through to default
            // processing
            //
        default:
            *p=*q;
            break;
        }
        ++p, ++q;
    }
done_quoted_string:
    *p='\0';

    while( *q==' ' || *q=='\t' )
        ++q;
    if( *q!=',' && *q!='\n' && *q!='\0')
        return E_QUOTED_STRING;
    *buff=--q;
    return 0;

}

//  returns
//   0 : to continue parse next record
//   non-zero:  abort processing
//       E_QUOTED_STRING is a special case of non-zero return values
//
static int csv_parse_line (struct csv_parser_data * d)
{
    char c;
    char * buff=d->buff;
    d->column_values[0]= buff;
    d->field_count=1;

    while ( (c=*buff)!='\n' )
    {
        switch (c)
        {
        case ',': // mark the end of the current field, and beginning of next field
            *buff='\0';
            d->column_values[d->field_count++]=buff+1;
            break;
        case '"': // beginning a quoted string
            if( E_QUOTED_STRING==csv_process_quoted_string(&buff) )
                return E_QUOTED_STRING;
            break;
        //default: no action

        }
        ++buff;
    }
    // now buff points to '\n', replace it with a '\0'
    *buff='\0';

    if (d->callback==NULL) 
        return 0;
    return d->callback (d->params, d->field_count, d->column_values);

}
// returns
//  0: on success
//  E_LINE_TOO_WIDE: on line too wide
//  E_QUOTED_STRING: at least 1 Quoted String is ill-formatted
//
int csv_parse(FILE *fp, CSV_CB_record_handler cb, void *params)
{
    //char buff[MAX_LINE_LEN];
    struct csv_parser_data d;
    
    d.callback = cb;
    d.params   = params;

    while ( d.buff[MAX_LINE_LEN-1]='*',
            NULL!= fgets (d.buff, MAX_LINE_LEN, fp)
    ){
        int r;
        if(d.buff[MAX_LINE_LEN-1]=='\0' && d.buff[MAX_LINE_LEN-2]!='\n')
            return E_LINE_TOO_WIDE;
        if (E_QUOTED_STRING==(r=csv_parse_line (&d) ) )
            return E_QUOTED_STRING;
        else if (r!=0)
            break;
    }
    return 0;
}

  • 0

#9 Lance

Lance

    CC Addict

  • Advanced Member
  • PipPipPipPipPip
  • 270 posts

Posted 14 April 2009 - 10:38 AM

test1.c: display content of a csv file
//---------------------------------------------------------------------------
// file: test1.c
//

#include "csv_parser.h"
#include <stdio.h>

// call back funtion
//
//
int display_csv(void * dummy, int cnt, const char ** cv)
{
    int i;
    printf(cv[0]);
    for(i=1; i<cnt; ++i){
        printf(" | %s",cv[i]);
    }
    printf("\n");
    return 0;
}

int main (int argc, char* argv[])
{
    FILE *fp;
    if (argc != 2)
    {
        printf("Usage: %s filename.csv\n", argv[0]);
        return 1;
    }

    if ( NULL==(fp=fopen (argv[1],"r") ) )
    {
        fprintf (stderr, "Cannot open input file: %s\n", argv[1]);
        return 2;
    }
    switch( csv_parse (fp, display_csv, NULL) )
    {
    case E_LINE_TOO_WIDE:
        fprintf(stderr,"Error parsing csv: line too wide.\n");
        break;
    case E_QUOTED_STRING:
        fprintf(stderr,"Error parsing csv: ill-formatted quoted string.\n");
        break;
    }

    fclose (fp);
    return 0;
}

  • 0

#10 Lance

Lance

    CC Addict

  • Advanced Member
  • PipPipPipPipPip
  • 270 posts

Posted 14 April 2009 - 10:40 AM

test2.c relies on the following data file

sales.csv
White,Johnson,408 496-7223,5335.5,"At a dinner party, one should eat wisely but not too well, and talk well but"
Green,Marjorie,415 986-7020,5226.4,"""The minute you settle for less than you deserve, you get even less than you settled for."""
Carson,Cheryl,415 548-7723,13265.5,"Ah, but a man's reach should exceed his grasp, Or what's a heaven for?"
O'Leary,Michael,408 286-2428,771.68,And there is even a happiness that makes the heart afraid
Straight,Dean,415 834-2919,3235.8,I like to see a man proud of the place in which he lives. I like to see a man live so that his place will be proud of him.
Smith,Meander,913 843-0462,4315.75,"The struggle against war, properly understood and executed, presupposes the uncompromising hostility of the proletariat and its organizations, always and everywhere, toward its own and every other imperialist bourgeoisie..."
Bennet,Abraham,415 658-9932,665.77,The best sauce for food is hunger
Dull,Ann,415 836-7128,998.63,Don't worry that children never listen to you. Worry that they are always watching you.
Gringlesby,Burt,707 938-6445,3314.12,No rule of success will work if you don't
Locksley,Charlene,415 585-4620,667.95,"Let me say to you that to do nothing at all is the most difficult thing in the world, and the most intellectual"
Greene,Morningstar,615 297-2723,23445.6,"The Lord made Adam, the Lord made Eve, he made em both a little bit naive."
Blotchet-Halls,Reginald,503 745-6402,779.73,"Any jackass can kick down a barn, but it takes a good carpenter to build one"
Yokomoto,Akiko,415 935-4228,3381.45,The secret of success in life is known only to those who have not succeeded
del Castillo,Innes,615 996-8275,23110.18,"It is said an Eastern monarch once charged his wise men to invent him a sentence to be ever in view, and which should be true and appropriate in all times and situations. They presented him the words: 'And this, too, shall pass away.' How much it expresses! How chastening in the hour of pride! How consoling in the depths of affliction!"
DeFrance,Michel,219 547-9982,779.64,The least pain in our little finger gives us more concern and uneasiness than the destruction of millions of our fellow-beings
Stringer,Dirk,415 843-2991,993.14,"If we don't end war, war will end us."
MacFeather,Stearns,415 354-7128,886.17,"If ignorance is bliss, why aren't there more happy people ?"
Karsen,Livia,415 534-9219,3345.15,"When men and women agree, it is only in their conclusions; their reasons are always different"
Panteley,Sylvia,301 946-8853,776.48,Delay of justice is injustice
Hunter,Sheryl,415 836-7128,5594.33,We have seen the enemy and he is us.
McBadden,Heather,707 448-4982,3817.18,
"Rin""ger",Anne,801 826-0752,4465.13,
Ringer,Albert,801 826-0752,5559.95,

test2.c
//---------------------------------------------------------------------------
// file: test1.c
//

#include "csv_parser.h"
#include <stdio.h>  // for fopen, fclose, etc.
#include <stdlib.h> // for atof;

struct process_sales_data
{
    float min;
    float max;
    float sum;
    int record_count;
    char min_employee_name[50];
    char max_employee_name[50];
    char min_employee_motto[100];
    char max_employee_motto[100];
};

// column[0]: last name
//        1 : first name
//        2 : telephone
//        3 : sales
//        4 : motto
//
int process_sales(void * data, int cnt, const char ** cv)
{
    struct process_sales_data *d=(struct process_sales_data*)data;

    float sales=atof(cv[3]);
    ++d->record_count;
    d->sum += sales;
    if( sales > d->max){
        d->max = sales;
        snprintf(d->max_employee_name,49,"%s %s", cv[1], cv[0]);
        snprintf(d->max_employee_motto, 99, "%s", cv[4]);
    }
    if(sales < d->min){
        d->min = sales;
        snprintf(d->min_employee_name,49,"%s %s", cv[1], cv[0]);
        snprintf(d->min_employee_motto, 99, "%s", cv[4]);
    }                  

    return 0;
}

int main ()
{
    FILE *fp;
    struct process_sales_data d;

    d.min=1E12;
    d.max=0.f;
    d.sum=0.f;
    d.record_count=0;
    //if (argc != 2)
    //{
    //    printf("Usage: %s filename.csv\n", argv[0]);
    //    return 1;
    //}

    if ( NULL==(fp=fopen ("sales.csv","r") ) )
    {
        fprintf (stderr, "Cannot open input file sales.csv\n");
        return 2;
    }
    switch( csv_parse (fp, process_sales, &d) )
    {
    case E_LINE_TOO_WIDE:
        fprintf(stderr,"Error parsing csv: line too wide.\n");
        break;
    case E_QUOTED_STRING:
        fprintf(stderr,"Error parsing csv: ill-formatted quoted string.\n");
        break;
    }
    
    fclose (fp);

    printf("%s has the maximum sales at %8.2f, his/her motto is: %s\n",
        d.max_employee_name, d.max, d.max_employee_motto);
    printf("%s has the minimum sales at %8.2f, his/her motto is: %s\n",
        d.min_employee_name, d.min, d.min_employee_motto);
    printf("Total sales is %10.2f, number of salesman is %d, average sales is %8.2f\n",
        d.sum, d.record_count, d.sum/d.record_count);

    return 0;
}

Let me know if you have problem compiling, running or understanding them.
  • 0





Also tagged with one or more of these keywords: csv

Powered by binpress