Jump to content

More of a Math Problem

- - - - -

  • Please log in to reply
3 replies to this topic

#1
Zer033

Zer033

    Learning Programmer

  • Members
  • PipPipPip
  • 79 posts
Okay I have a scenario I want some input on please. I may create a C++ program involving this scenario in some form plus good programmers are usually also good at algorithms and math so I figure it might be good to ask here.

There are 6 mailings (1, 2, 3, 4, 5, 6). What I mean by a mailing is that each consists of X number of names from multiple mailing lists (6 mailing lists in total to represent each specific mailing). They can borrow names from each other. These names on the lists will be merged and purged against each other (some names can be duplicates) in a priority order where the specific mailing it is has the highest priority. So if it is a "1" mailing it might have "2", "3", "4" parts in it as well, but any names that are in "1" and "2" or "1" and "3" or "1" and "4" will be mailed as a "1" Title and those names will not be included for the other Titles.

The names will be selected by a person based on good performance. So "1" might have a selected 10,000 names, "2" might have a selected 5,000 names, "3" might have a selected 3,000 names, and "4" might have a selected 1,000 names. However, the number of names mailed will not be the sum of those numbers because some of the names will be duplicates. So after the merging and purging of the names "1" will survive at a high rate 99.9% or so while "2", "3", and "4" survive less.

The point of all that was just to get you to understand the set up. The question I want to ask is coming up.

These mailings get mailed and the people buy things from them. Each day a certain number of names are put into their own list, the buyers list. This buyers list is broken into 3 categories; they can be called A, B and C. The buyers list is always included in a mailing so that the people that just bought will be mailed again.

The number of names on the buyers list needs to be predicted about 7-14 days prior to actually mailing so that a number can be estimated for the total names in the entire mailing. As the day to finalize the mailing approaches the closer this prediction is the better.

The buyer's list heading on the mailing will have a date range, for example 06/01/2010-06/18/2010. This means that the heading on the mailing represents all the people that bought from each of the 6 mailings defined above in that span. The number of names before any kind of merging or purging has to be predicted.

As before this buyer's list heading is selected by performance. You can select from each of the categories (A, B, C) and you can select within each of those categories dollar ranges ($50+, $40+, $30+, $20+, $10+). The dollar ranges mean those people selected in that square bought a $50+ item, $40+ item, etc.

The data given is:

1.) About 5 months worth of actual daily numbers of how many buyers per day in each category.
2.) Actual selections by $ amount for 5 months worth of date ranges.

The things that have to be predicted are:

1.) How to predict how many buyers will be added to the list per day.
2.) How many people the selection made will be.

#1 is predicted already, so you can assume you have access to that predicted number already (but if you have a really good way to predict that then please say).

#2 is the main prediction we want to do. The way I do it relies on the accuracy of #1 though.


So here is how I did it:

I took the 5 month history of daily buyers which was broken down like this for each mailing:

Mailing 1

        Category A    Category B    Category C    Total
Day1    some #        some #        some #        some #
Day2    some #        some #        some #        some #
Day3    some #        some #        some #        some #
Day4    some #        some #        some #        some #
Day5    some #        some #        some #        some #
Day6    some #        some #        some #        some #
DayN    some #        some #        some #        some #
and figured out how much of a % each mailing was from the total for each day.

So the table would look like this:

    
            Category A    Category B    Category C    Total
Day1        some %        some %        some %        some %
Day2        some %        some %        some %        some %
Day3        some %        some %        some %        some %
Day4        some %        some %        some %        some %
Day5        some %        some %        some %        some %
Day6        some %        some %        some %        some %
I did that by dividing the Category A, B, or C number by the Total for that day. The Total % was obtained by adding all the Totals for the day from each of the mailings (all 6) and dividing each mailing's Total # by that combined total of all the mailings.

Next I went through the last 5 months of old mailings and looked at their buyer's list span. I went back to the actual totals (1st table above) and calculated the total from each mailing for each span. So what I ended up with was:

In the February mailing, for example, the span might be 02/01/2010-02/20/2010. I added all the "1" buyers, "2" buyers, "3" buyers, "4" buyers, "5" buyers, "6" buyers. and ended up with:

1= some #
2= some #
3= some #
4= some #
5= some #
5= some #

I did this for 5 months worth of mailings. That is 1 mailing per month of each type. So 5 "1"s, 5 "2"s, 5 "3"s, 5 "4"s, 5 "5"s, 5 "6"s.

Next, I got the total of each of the selections for each category and $ range. The table would look like this:

            mailing1    mailing2    mailing3    mailing4    mailing5    mailing6    
Category A
$50+        
$40+
$30+
$20+
$10+
Category B
$50+
$40+
$30+
$20+
$10+
Category C
$50+
$40+
$30+
$20+
$10+
Once I populated all the tables for each month (30 tables in all) I divided each of the numbers in each of the tables by their respective Total mailing # (the one in the February example). This gave the % that each Category and $ combo is from the total selection for the buyer list time span per month.

Finally, I averaged the Total %'s from the second table above (5 months worth) so I could use this % later in estimating. This would represent the % that each mailing # is of the total time span from the buyer's list.

I also averaged each of the 30 table's corresponding mailing columns. So for instance in mailing X I had 5 tables (one for each month). Those 5 tables each had 6 "mailing #" headings which meant that names from each of those mailing's lists were used in that month's mailing X. So for mailing X I averaged 5 months worth of mailing 1 category A $50+, mailing 1 category A $40+, etc. I did this until I populated a percentage table of averaged percents. I would later use this table to estimate the number of names within some specific mailing/category/$ combo.


Once I had these averages I looked on the predicted report of buyer's per each day from the the #1 predictions that need to be made heading above. I added these names together to get some total # for some specific span of time.

I multiplied that number by the first average % I described 3 paragraphs ago. This gave me a breakdown of how the divisions were divided up in just one mailing X:

mailing 1: some # obtained by the average of all days mailing 1 total %'s in table 2
mailing 2: some # obtained by the average of all days mailing 2 total %'s in table 2
mailing 3: some # obtained by the average of all days mailing 3 total %'s in table 2
mailing 4: some # obtained by the average of all days mailing 4 total %'s in table 2
mailing 5: some # obtained by the average of all days mailing 5 total %'s in table 2
mailing 6: some # obtained by the average of all days mailing 6 total %'s in table 2

The next step was to further divide these mailings estimated numbers by category and $ amount.

Depending on the selection (all dollar ranges or categories don't have to be actually selected) the sum will change.

So by saying I take the selection of mailing 1, category A, $50+ dollar amount the cell for that would be populated with the result of:

"mailing 1: some # obtained by the average of all days mailing 1 total %'s in table 2" * the averaged percent

This gets the total of people in the mailing 1 selection that are in category A and bought a $50+ item.


If this was at all possible to follow the things I'm wondering are:

1.) What would be the best way to estimate how many buyer's per day? As of now I'm using an already estimated report that I'm not sure who creates. It is off by a few thousand per day usually.

2.) Not related to actually estimating the number of buyer's per day, but instead estimating how much our selection will be (mailing #/category/$ amount), given some already obtained buyer per day numbers, is averaging the %'s over a 5 month span an okay way to go about that? Or is there a better way.

#2
WingedPanther

WingedPanther

    A spammer's worst nightmare

  • Moderators
  • 16,831 posts
  • Location:Upstate, South Carolina
  • Programming Language:C, C++, PL/SQL, Delphi/Object Pascal, Pascal, Transact-SQL, Others
  • Learning:Java, C#, PHP, JavaScript, Lisp, Fortran, Haskell, Others
First issue: Saying an estimate is off by a few thousand per day isn't meaningful. What percent error is that?
Second issue: Any prediction you make will have an associated uncertainty. Depending on how many rounds of mailings you have data for, this uncertainty may be very high, no matter what method you use, resulting in a high level of expected error in your prediction.
Third issue: It's not clear to me what statistical model you think is appropriate for making your prediction. Since we don't really have any data to examine, there's no way to tell how the data should be modeled and, thus, how to make a prediction. Given the wide variety of probability distributions, and various techniques for optimizing each of them to create a best fit to the data you have, you are looking at a non-trivial task.

Generally, what you're being asked would require you to have familiarity with advanced (calculus-based) statistics. Unless you have access to research that's already been done (and probably wasn't published for competitive reasons), you may have to reinvent the wheel.
Programming is a branch of mathematics.
My CodeCall Blog | My Personal Blog

#3
Zer033

Zer033

    Learning Programmer

  • Members
  • PipPipPip
  • 79 posts
I'll have to look at the % error in that buyer's list prediction. I didn't have access to more than a month or so of past predictions so I couldn't look into that. I mainly was focused on the estimate after using that initial prediction report, so basically I was assuming that original prediction report was good as is (although it's not so I want to look at it at some point).

As for the prediction data I used, I chose 5 months because it seemed like a long enough span, but it wasn't so long of a span that seasonality or even performance that was too old skewed it too much. I planned to always update the predictions by the latest 5 months so it would be new data instead of actually adding a month onto the prediction each month. So I was going to replace the oldest month of data with the newest month with each passing month so that data changed with the times since the sales won't always be the same each and every month, but they may be similar within a given 5 month span whereas they might not be anything alike if you were to take month 1 and month 12 of a 12 month span.

I've primarily been just averaging these large amounts of data to get my estimates. I'm wondering if there's a better way than simply averaging. I've looked a bit into standard deviation and standard error, but I'm not entirely sure how to apply that correctly.

#4
WingedPanther

WingedPanther

    A spammer's worst nightmare

  • Moderators
  • 16,831 posts
  • Location:Upstate, South Carolina
  • Programming Language:C, C++, PL/SQL, Delphi/Object Pascal, Pascal, Transact-SQL, Others
  • Learning:Java, C#, PHP, JavaScript, Lisp, Fortran, Haskell, Others
http://www.causascie.../Compendium.pdf
http://www.vosesoftw...ntent/ebook.pdf
Statistical Distributions - Overview
Equations: Probability Distributions - SITMO
Programming is a branch of mathematics.
My CodeCall Blog | My Personal Blog




1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users