Jump to content

Check out our Community Blogs


Member Since 30 Jan 2009
Offline Last Active May 20 2012 11:02 PM

#630138 Building Web Page Clustering Application, Need Help

Posted by jwxie518 on 20 May 2012 - 10:43 PM

You said you might end up going to two threads (one with 1 reply, one with 30 replies). If the link is of the form `http://codecall.net/topic/xxxxxxxxx` then you should just ignore the rest. Just analyze one?

This leads to my question: what's the purpose of this application? Because it is very unlikely that `http://codecall.net/forum/xxxxxx` will use the same template as `http://codecall.net/topic/xxxx` logically from a human perspective, so, as the author of the machine, if it is your first time encounter `/topic/xxxx` this form, put it in the queue to analyze. Learn it. Then if your machine finds another path exactly of that form, don't put it in the analyzer queue (or / and) instead, just mark it "walk-through only" - that is, finds links only.

Now, if you have worked on sites like Drupal-based, many of their contents are actually based one some global templates (like a Page template). The urls may look different. Now that's the part you need to analyze. If the links are of the same format, you should assume they use the same template. Or you can just randomly choose them and analyze. If your machine learns ration different is high, alert the learning state and say "hey bro, this url may actually use different templates based on some kind of preferences: maybe by GET, POST, PUT, DELETE, or by language difference, by javascript / non-javascript, etc).

That's the way I see it... and I also wonder if compress the two html files would help at all (barebone structures - with name, tags, maybe, definitely not any content strings though). That might be the first step to comparison.
  • 2

#630090 Building Web Page Clustering Application, Need Help

Posted by jwxie518 on 20 May 2012 - 01:25 AM

Comparing HTML tags are not going to help you. Comparing HTML pages is extremely slow. Just check out the Page Source of CodeCall (right here, check it out). The file is huge, contains many strings.

You need to setup some criteria. Based on them you consider whether they are using the same template.
Remember nowadays we don't make one big template. We "include" or "extend" some templates. You also can't remove all strings from the page source because some contents can provide usefulinformation.

Why not analyze the links? URLs? make them into some graphs, consider the weight and the distance.
I mean, if you remove certain HTML elements and contents in the post (you know for certain some pages contain dynamic information, such as a forum thread page) from your search, and just concentrate on links such as the urls to button images, avatar, etc, then you can say they are very likely come from the same group of templates.

Also, just because two pages generate 80% different contents from HTML page source, does not mean they are not based on the same template.

I can have this
{% load data based on url %}    // RESTful API .....

I can generate a page of tables, or a page of list. They don't show the same structure.

So you will definitely miss some of them.
  • 1

#574931 I need your help, CodeCallers!

Posted by jwxie518 on 28 September 2010 - 09:02 PM

I did too :)
  • 1

#482323 C and C++ suck big time

Posted by jwxie518 on 10 July 2009 - 09:38 AM

(this is my second time writing this cuz I closed my browser by accident)

I am only a C++ learner, I am not people like Panther or Math who has real world experience and they know (obviously) more than what I KNOW from discussion and books.

First of all, technology does not come from one single language. There are many many programming languages (let us just count PHP, JSP too... forget about those fancy names). Each language has its pro and con. In real world programming, not a single language rule, though a few may share larger portion of the market but it does not mean that any of those rule the ENTIRE FIELD.

You need to choose the right one for your application. It depends on the type of application, the need of the application (that also includes the user and machine), the cost and time of the application and the programmer as an individual. Soon I will recommend two threads (actually it's 3) that will guide to understand more about "choosing the right one" instead of comparing "high and low" level languages.

It is totally unfair to say that C and C++ suck big time. It only reflects the bias or the ignorance of you not knowing "computer science" (a fancy name to programming) well enough (you don't need to be a major in CS). C and C++ are old but there are older languages like BASIC, Perl. Most of the very old languages are very outdated and not use anymore, but it does not mean that they have no contribution to our modern "high level languages".

Java, is a mainstream modern "high" level language commonly used and leaned today, but it is influenced by C, C++ and Perl (and a few others). These "influences" are the building blocks of Java. This is how things work. Everything come from one idea (that's computing) and evlove over time. So it is not good to say that C and C++ has no meaning anymore.

Still, old languages like Perl is still used today, just not as popular as others. New releases continue, and it's up to Perl 6 already. Same thing with PHP. PHP is consider to be a very old client side language, some 14 years already. But it just gets so popular these days that people use it more (also to the cost it's free).

Back to my point. Like Math had said that everything you had today are mainly programmed in C/C++. That's being said that you could replace them with higher level languages but the performance may not be as effecicent as those written in C/C++. I don't have any data to support. But if you think about it: people are not dumb. They use certain languages for certain things because they think they are better compare to other choices.

Java is not a new languages anyway if you have read the history of JAVA. The idea was launched a few years before PHP came to birth. Computing languages are out after a number of years of development. Ruby wasn't really that new either. It was on the mind around the same time as Java and PHP. So what is new??

Higher level computing language - I can't give my own opinion because I am still a learner that I have no experience with anything. But I will redirect you to the proper threads and you will see what people say about them.

Games and OS are still programmed in C and C++. As someone say "ACM and C still rule embedded system" so what do think? Maybe we can punch in a higher level language but they don't work as well as the lower one. If you need full control you need assembly.

There is not a single one that can rule over another. It only rules when it comes to the specific need. If you still think that higher level language means "moderner" and "better" that's a totally wrong idea about computing language. As we see, these higher level languages are developed some 10-20 years ago so there is none that is really modern. Computing language field is always at its pace. You cannot remove C and C++ at all. Assembly language was way older than everything I have mentioned here (including BASIC, Perl). You should read wiki and learn about what it is.

These computing languages could put into groups. Read here:Programming paradigm - Wikipedia, the free encyclopedia and you will see that each group will has its own pro and con. You need certain groups for certain cases. Unix is a operating system but it is still used today as the base of Mac OS X and MS Windows. It is not a language but the idea is old already. If it wasn't BASIC, without it, MS will not have people create useful and have impacts on OS and how to program computers. VB replaces BASIC and is continue to be a major use in OS.

If you think that higher level like C# and Java, then one could argue that C# is just a bunch of garbage collection of Java and C++. It could be true in certain ways but still very depends on the person. But that does not mean that C# sucks to the bottom because there are great advantages of C# while others like C++ lack of. If you are a real and responsible programmer you will compare languages and think thoroughly before making a choice "which PL should I go with" , or which combination should you go with. Also, because every piece of hardware and software could program in another language and there could be a friction that prevents one language to be use if the efficiency and compatibility is no good.

By the way, one of the oldest programming languages, COBOL is still used widely in BUSINESS. So just you know that it's very old and very "low" according to your standard and it's still use today. (It isn't low anyway, it was high level considerably at the time compare to other newer languages because it was both procedural, and object-oriented). The standard 2002 was released and it was improved with more modern features.

Lastly but not the least, modern big software company requires software engineer to know C/C++, Java or Python (though Java is a must already these days). These are just standards posted by Google, MS, Apple, Adobe. You need to know them but you don't need to use all of them for everything. Google Engine (the entire Google Platform) was built by several languages, that include C, C++, Python, Java and PHP. You see that in real world, there is nothing "rule" but each language will replace the "short" of the others. One takes the advantage and the other provide the "shorts".

While JSP is very good, we say it's very expensive compare to PHP (this is also a reason why ASP drops its market). We might say Java's compatibility is bad that could be very true too.

If you think that higher level means "better, way better" that is totally wrong and is no good in your future study because it will only degrade your skill. New C++ standard is on its way as I read in the threads.

Older may stopped its development. But even one of the oldest, COBOL is still use today as it adapts more modern-like features. C++ gave a real blast to people that object-orientated is the way to go.

Please correct me if I make any mistake here. I'd like to correct myself.

Here are threads you should read:
  • 1

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download