Jump to content


Check out our Community Blogs

jwxie518

Member Since 30 Jan 2009
Offline Last Active May 20 2012 11:02 PM
-----

Posts I've Made

In Topic: Building Web Page Clustering Application, Need Help

20 May 2012 - 10:43 PM

You said you might end up going to two threads (one with 1 reply, one with 30 replies). If the link is of the form `http://codecall.net/topic/xxxxxxxxx` then you should just ignore the rest. Just analyze one?

This leads to my question: what's the purpose of this application? Because it is very unlikely that `http://codecall.net/forum/xxxxxx` will use the same template as `http://codecall.net/topic/xxxx` logically from a human perspective, so, as the author of the machine, if it is your first time encounter `/topic/xxxx` this form, put it in the queue to analyze. Learn it. Then if your machine finds another path exactly of that form, don't put it in the analyzer queue (or / and) instead, just mark it "walk-through only" - that is, finds links only.

Now, if you have worked on sites like Drupal-based, many of their contents are actually based one some global templates (like a Page template). The urls may look different. Now that's the part you need to analyze. If the links are of the same format, you should assume they use the same template. Or you can just randomly choose them and analyze. If your machine learns ration different is high, alert the learning state and say "hey bro, this url may actually use different templates based on some kind of preferences: maybe by GET, POST, PUT, DELETE, or by language difference, by javascript / non-javascript, etc).

That's the way I see it... and I also wonder if compress the two html files would help at all (barebone structures - with name, tags, maybe, definitely not any content strings though). That might be the first step to comparison.

In Topic: Nyc Subway System E/r Diagram

20 May 2012 - 01:45 AM

Ah. Yes. You are right. I made a mistake. I thought turnstile = turnstile data. In fact, I need another relation.
Thanks.

In Topic: Best Interface Framework For Python

20 May 2012 - 01:30 AM

TKinter is good but I think it doesn't support tabs out of the box. I'll suggest wxPython because it has a good documentation
and support.


You can try Tix which is an abstract layer added for Tkinter.

In Topic: Building Web Page Clustering Application, Need Help

20 May 2012 - 01:25 AM

Comparing HTML tags are not going to help you. Comparing HTML pages is extremely slow. Just check out the Page Source of CodeCall (right here, check it out). The file is huge, contains many strings.

You need to setup some criteria. Based on them you consider whether they are using the same template.
Remember nowadays we don't make one big template. We "include" or "extend" some templates. You also can't remove all strings from the page source because some contents can provide usefulinformation.

Why not analyze the links? URLs? make them into some graphs, consider the weight and the distance.
I mean, if you remove certain HTML elements and contents in the post (you know for certain some pages contain dynamic information, such as a forum thread page) from your search, and just concentrate on links such as the urls to button images, avatar, etc, then you can say they are very likely come from the same group of templates.

Also, just because two pages generate 80% different contents from HTML page source, does not mean they are not based on the same template.

I can have this
<html>
<head></heaD>
<body>
{% load data based on url %}    // RESTful API .....
</body>
</html>

I can generate a page of tables, or a page of list. They don't show the same structure.

So you will definitely miss some of them.

In Topic: Nyc Subway System E/r Diagram

19 May 2012 - 06:39 PM

Hi Orjan,

Thanks. I've updated my E/R diagram
http://i.stack.imgur.com/8YzQ9.png

Advices number 1 and 2 are not applicable because my application do not need them. So we can safely disregard those relations.


I think these are my relations.

Train(name)
Station(station_id, station_name, logitude, latitude)
TrainManager(train_name, station_id)

RemoteUnit(remote_unit_key)
StationRemoteManager(station_id, remote_unit_key)

ControlArea(ctrl_area_unit_id, ctrl_area_key)
ControlManager(remote_unit_key, ctrl_area_unit_key, ctrl_area_key)

Turnstile(scp_code, ctrl_area_unit_id, ctrl_area_key)


Should I generate a table for the red rhombus?

Thanks,



These are my SQL statements (mysql)


CREATE TABLE Train (
    name    VARCHAR(5) PRIMARY KEY
);



CREATE TABLE Station(
    station_id    INT PRIMARY KEY,
    station_name   VARCHAR(60),
    logitude    DECIMAL(8,5),
    latitude	 DECIMAL(8,5)
);


CREATE TABLE TrainManager(
    train_name    VARCHAR(5) REFERENCES Train(name),
    station_id	  INT REFERENCES Station(station_id),
    UNIQUE (station_id)
);


CREATE TABLE RemoteUnit(
    remote_unit_key VARCHAR(10) PRIMARY KEY
);




CREATE TABLE StationControlManager(
    station_id INT REFERENCES Station(station_id),
    ctrl_area_unit_id INT REFERENCES ControlArea(ctrl_area_unit_id),
    UNIQUE(station_id, ctrl_area_unit_id)
);


CREATE TABLE ControlArea(
    ctrl_area_unit_id    INT UNIQUE,
    ctrl_area_key	  VARCHAR(10) UNIQUE
);


CREATE TABLE ControlManager(
    remote_unit_key    VARCHAR(10) REFERENCES RemoteUnit(remote_unit_key),
    ctrl_area_unit_id    INT REFERENCES ControlArea(ctrl_area_unit_id),
    ctrl_area_key	  VARCHAR(10) REFERENCES ControlArea(ctrl_area_key),
    UNIQUE (remote_unit_key, ctrl_area_unit_id, ctrl_area_key)
);

CREATE TABLE Turnstile(
    scp_code    VARCHAR(20),
    ctrl_area_unit_id    INT REFERENCES ControlArea(ctrl_area_unit_id),
    ctrl_area_key	  VARCHAR(10) REFERENCES ControlArea(ctrl_area_key),
    date DATE,
    time time,
    descn BIT(1),
    entries_n BIGINT UNSIGNED NOT NULL,
    exists_n BIGINT UNSIGNED NOT NULL,
    UNIQUE (ctrl_area_unit_id, scp_code, date, time)
);

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download