Hi gurus,
There is a property advertisement web site that generates descriptions for each property automatically. There are several templates that application selects one of them and fills it with words that come from various tables. The application has to produce descriptions for by average 1000 properties at each run.
We have 250,000 properties in DB and there is a chance that under race circumstances we re-generate all the existing descriptions.
The application uses a C# implementation of Shingle Analysis algorithm. It compares the new sentence with the existing sentences and calculates a rank. Then selects the template that has the lowest similarity.
When the application is run, it retrieves the ID of the properties and then it selects a template and fills it with data. Each template tag might be filled with different words:
The house has <tag1> and <tag2>.
The Big Boss wants me to replace the above tags with random words, each time a description is generated. Such as:
1- The house has Pool Tennis Court.
2- The house has Fireplace and a nice view.
As you see the ranks might be different for a template each time it is generated so we cannot just store the ranks.
I have to say that I’ve found the cause of the problem and I think I have to find a better approach for this problem rather than just optimizing the Shingle Analysis code. Here is the problem:
1- The code gets a template from the available template options (there are more than templates and the app choses one randomly).
2- The code fills the template.
3- The code loads all the previously generated sentences and compares the new sentence with all them and calculates the ranks and generates a list of ranks.
4- The code selects the template that has the lowest rank (if it is not the current template)
5- The code re-generates the sentence with the new template.
6- The code sores the generated sentence.
The above process happens for every property so after processing 999 templates we will have 999 long descriptions in the DB. Therefore for 1000th property, the new sentence is compared with the description of the previous 999 properties, which is very time-consuming.
When I start generating the sentences for 1000 properties, at the beginning it is very quick but the more properties are processed the slower the application becomes. Can you think of a solution to get rid of this massive string? I believe that if even I speed up the Shingle code for 5 times, the whole process would be yet very slow.
Thanks a lot in advance,
Performance of Shingle Analysis (for comparing strings)
Started by PersianAussie, Aug 14 2011 04:22 PM
1 reply to this topic
#1
Posted 14 August 2011 - 04:22 PM
|
|
|
#2
Posted 15 August 2011 - 04:50 AM
Quote
We have 250,000 properties in DB and there is a chance that under race circumstances we re-generate all the existing descriptions.
1 user(s) are reading this topic
0 members, 1 guests, 0 anonymous users


Sign In
Create Account

Back to top









