Jump to content


Check out our Community Blogs

jwxie518's Content

There have been 16 items by jwxie518 (Search limited from 01-December 19)


Sort by                Order  

#630138 Building Web Page Clustering Application, Need Help

Posted by jwxie518 on 20 May 2012 - 10:43 PM in Python

You said you might end up going to two threads (one with 1 reply, one with 30 replies). If the link is of the form `http://codecall.net/topic/xxxxxxxxx` then you should just ignore the rest. Just analyze one?

This leads to my question: what's the purpose of this application? Because it is very unlikely that `http://codecall.net/forum/xxxxxx` will use the same template as `http://codecall.net/topic/xxxx` logically from a human perspective, so, as the author of the machine, if it is your first time encounter `/topic/xxxx` this form, put it in the queue to analyze. Learn it. Then if your machine finds another path exactly of that form, don't put it in the analyzer queue (or / and) instead, just mark it "walk-through only" - that is, finds links only.

Now, if you have worked on sites like Drupal-based, many of their contents are actually based one some global templates (like a Page template). The urls may look different. Now that's the part you need to analyze. If the links are of the same format, you should assume they use the same template. Or you can just randomly choose them and analyze. If your machine learns ration different is high, alert the learning state and say "hey bro, this url may actually use different templates based on some kind of preferences: maybe by GET, POST, PUT, DELETE, or by language difference, by javascript / non-javascript, etc).

That's the way I see it... and I also wonder if compress the two html files would help at all (barebone structures - with name, tags, maybe, definitely not any content strings though). That might be the first step to comparison.



#630092 Nyc Subway System E/r Diagram

Posted by jwxie518 on 20 May 2012 - 01:45 AM in Databases

Ah. Yes. You are right. I made a mistake. I thought turnstile = turnstile data. In fact, I need another relation.
Thanks.



#630091 Best Interface Framework For Python

Posted by jwxie518 on 20 May 2012 - 01:30 AM in Python

TKinter is good but I think it doesn't support tabs out of the box. I'll suggest wxPython because it has a good documentation
and support.


You can try Tix which is an abstract layer added for Tkinter.



#630090 Building Web Page Clustering Application, Need Help

Posted by jwxie518 on 20 May 2012 - 01:25 AM in Python

Comparing HTML tags are not going to help you. Comparing HTML pages is extremely slow. Just check out the Page Source of CodeCall (right here, check it out). The file is huge, contains many strings.

You need to setup some criteria. Based on them you consider whether they are using the same template.
Remember nowadays we don't make one big template. We "include" or "extend" some templates. You also can't remove all strings from the page source because some contents can provide usefulinformation.

Why not analyze the links? URLs? make them into some graphs, consider the weight and the distance.
I mean, if you remove certain HTML elements and contents in the post (you know for certain some pages contain dynamic information, such as a forum thread page) from your search, and just concentrate on links such as the urls to button images, avatar, etc, then you can say they are very likely come from the same group of templates.

Also, just because two pages generate 80% different contents from HTML page source, does not mean they are not based on the same template.

I can have this
<html>
<head></heaD>
<body>
{% load data based on url %}    // RESTful API .....
</body>
</html>

I can generate a page of tables, or a page of list. They don't show the same structure.

So you will definitely miss some of them.



#630086 Nyc Subway System E/r Diagram

Posted by jwxie518 on 19 May 2012 - 06:39 PM in Databases

Hi Orjan,

Thanks. I've updated my E/R diagram
http://i.stack.imgur.com/8YzQ9.png

Advices number 1 and 2 are not applicable because my application do not need them. So we can safely disregard those relations.


I think these are my relations.

Train(name)
Station(station_id, station_name, logitude, latitude)
TrainManager(train_name, station_id)

RemoteUnit(remote_unit_key)
StationRemoteManager(station_id, remote_unit_key)

ControlArea(ctrl_area_unit_id, ctrl_area_key)
ControlManager(remote_unit_key, ctrl_area_unit_key, ctrl_area_key)

Turnstile(scp_code, ctrl_area_unit_id, ctrl_area_key)


Should I generate a table for the red rhombus?

Thanks,



These are my SQL statements (mysql)


CREATE TABLE Train (
    name    VARCHAR(5) PRIMARY KEY
);



CREATE TABLE Station(
    station_id    INT PRIMARY KEY,
    station_name   VARCHAR(60),
    logitude    DECIMAL(8,5),
    latitude	 DECIMAL(8,5)
);


CREATE TABLE TrainManager(
    train_name    VARCHAR(5) REFERENCES Train(name),
    station_id	  INT REFERENCES Station(station_id),
    UNIQUE (station_id)
);


CREATE TABLE RemoteUnit(
    remote_unit_key VARCHAR(10) PRIMARY KEY
);




CREATE TABLE StationControlManager(
    station_id INT REFERENCES Station(station_id),
    ctrl_area_unit_id INT REFERENCES ControlArea(ctrl_area_unit_id),
    UNIQUE(station_id, ctrl_area_unit_id)
);


CREATE TABLE ControlArea(
    ctrl_area_unit_id    INT UNIQUE,
    ctrl_area_key	  VARCHAR(10) UNIQUE
);


CREATE TABLE ControlManager(
    remote_unit_key    VARCHAR(10) REFERENCES RemoteUnit(remote_unit_key),
    ctrl_area_unit_id    INT REFERENCES ControlArea(ctrl_area_unit_id),
    ctrl_area_key	  VARCHAR(10) REFERENCES ControlArea(ctrl_area_key),
    UNIQUE (remote_unit_key, ctrl_area_unit_id, ctrl_area_key)
);

CREATE TABLE Turnstile(
    scp_code    VARCHAR(20),
    ctrl_area_unit_id    INT REFERENCES ControlArea(ctrl_area_unit_id),
    ctrl_area_key	  VARCHAR(10) REFERENCES ControlArea(ctrl_area_key),
    date DATE,
    time time,
    descn BIT(1),
    entries_n BIGINT UNSIGNED NOT NULL,
    exists_n BIGINT UNSIGNED NOT NULL,
    UNIQUE (ctrl_area_unit_id, scp_code, date, time)
);



#629783 Nyc Subway System E/r Diagram

Posted by jwxie518 on 14 May 2012 - 09:15 PM in Databases

The following are the schema provided, and after each I provide the sample data. The objective is to design a good database that clearly shows the relationships among different entity sets
(1)


Station(SationId, StationName, Line, Division, Latitude, Longitude)
http://goo.gl/EiauX

(2)


ControlRemote(ControlAreaUnitId, ControlArea, RemoteUnit, StationId, LineName)
http://goo.gl/Kyxph

(3)


ControlSCP(ControlAreaUnitId, SCP)
Relationship b/w control areas and turnstiles: http://goo.gl/QBzDg

Basic outline:

  • the system consists of multiple lines (1/2/3, A/B/C, etc) and multiple stations

  • Subway station usage data is collected by one or more control areas (uniquely identified by a ControlAreaUnitId (artificial key) or ControlArea key.

  • A group of control areas (one or more) are managed by a remote unit (uniquely identified by RemoteUnit key)

  • Big stations MAY have multiple remote units

  • A group of small stations MAY share one remote unit

  • A control area collects data from one or more turnstiles

  • A turnstile is identified by SCP code within the control area
Here is my diagram. Posted Image
The red supporting relation is actually a weak one (I can't find a symbol for it in my drawing tool). SCP Data I mean Turnstile.
What do you guys think?

Question

1. The original schema ControlRemote has StationID. Neither Remote nor ControlArea has a StationID key in it, so I guess the best way to identify is setup a supporting relation StationManager so we can identify which station a remote unit belongs to (and also Control Area).

2. When we convert them into SQL statements, we don't need to generate the red relation right? Since in a weak entity, the one in red is reduantant (many-to-one).

These are the relations I think we need to put into DB


Train(name)
Station(station_id, station_name, logitude, latitude)
TrainManager(train_name, station_id)
RemoteUnit(remote_unit_key)
StationRemoteManager(station_id, remote_unit_key)
ControlArea(ctrl_area_unit_id, ctrl_area_key)
ControlManager(remote_unit_key, ctrl_area_unit_key, ctrl_area_key)
Turnstile(scp_code, ctrl_area_unit_id, ctrl_area_key)


3. Do you guys think this is a good design? I feel like having so many extra supporting relations seem unnecessary.



Thanks.



#610306 How to look into memory?

Posted by jwxie518 on 30 September 2011 - 06:23 PM in General Programming

Professor gave us a challenge. He said the following (LOL fine.. paraphrased)

Suppose you have the following C++ statement:
c = a + b;
I don't care what u put for a and b.
a = 10, b = 200; I don't care.

I want you guys to be able to tell me the address of c, and in essence you can find the value before a+b and after a+b;
I don't want you to use cout. No printing on the screen. NO. This mean no reference, no pointer.


Now the problem is: I have never done any memory dump. I have linux by the way.
I was reading about hex dump (Hex dump - Wikipedia, the free encyclopedia)

Anyone can think of a good approach?

Thanks.



#606564 Javascript IDE

Posted by jwxie518 on 06 August 2011 - 12:35 PM in General Programming

I have tried many Javascript IDE.
Aptana seems to be the best, yet, I don't understand how to use the debugger. I already installed Firebug, and I can see the wrapper.

Say I want to run

var hellostring = "string";

I want to call hellostring, and output "string"

I can do this using the built-in debugger in the browser (for example, using Chrome).
Is there a solution to this?

Do I always have wrap my code around HTML? I am getting frustrated at how to develop javascript projects. I understand that JS needs HTML...

Thanks



#604187 bandwidth

Posted by jwxie518 on 03 July 2011 - 12:37 PM in The Lounge

Actually some ISP intentionally block P2P. But there is always a way to get around with that block. In general people will figure out a way to crack down that block.
I have never tried FiOS, but it is supposed to be the fastest service, but there comes DOCSIS3 which is very expensive to afford, unless you have a rich cable service anyway.



#604185 How do you run Hello World in C++? [Expert Needed]

Posted by jwxie518 on 03 July 2011 - 12:28 PM in The Lounge

This reminds me of obfuscated code contests particularly the one requiring to write "self reproducing program"

Quine (computing) - Wikipedia, the free encyclopedia

Wow. that's interesting! When I first saw Quine I thought you were talking about Quine–McCluskey algorithm. :]



#604184 How do you run Hello World in C++? [Expert Needed]

Posted by jwxie518 on 03 July 2011 - 12:27 PM in The Lounge

In theory it should work. A BMP file is literally, a map of bits (with some meta information at the beginning). So, if you convert the C code from ASCII to hex, and breaking it up into 24 bit chunks, you will end up with a list of what could be used as hex colors.

Yes. I actually looked at your recent blog. A success :]



#604084 How do you run Hello World in C++? [Expert Needed]

Posted by jwxie518 on 01 July 2011 - 09:05 AM in The Lounge

I doubt. Maybe it was just photoshopped. When I have time I should try this out. Ha.



#604047 How do you run Hello World in C++? [Expert Needed]

Posted by jwxie518 on 30 June 2011 - 06:21 PM in The Lounge

The best solution is here
Posted Image

http://i.imgur.com/QlGpd.gif



#604046 How do you run Hello World in C++? [Expert Needed]

Posted by jwxie518 on 30 June 2011 - 06:20 PM in The Lounge

Posted Image


I don't understand.


........

WHAT?


Go read this guy
visual c++ - Why is this program erroneously rejected by three C++ compilers? - Stack Overflow



#604045 bandwidth

Posted by jwxie518 on 30 June 2011 - 06:17 PM in The Lounge

Thank you to all three of you. Now I understand.

Not really. One cable for 3 floors, 6 computers. Not cool -_-
I am still waiting for FiOS.



#603950 bandwidth

Posted by jwxie518 on 29 June 2011 - 08:27 AM in The Lounge

Till this date I still don't understand what bandwidth is in Internet network. I know by definition it is the amount of data that can be transmitted through the channels during a specific period of time.

Faster services usually comes with higher bandwidth. As a Cable user in New York City, the company oversold (due to the its monopolistic operation), and so hundreds, if not, thousands of users are usually assigned to the same pool.

Does higher bandwidth really help? The dilemma is that after downloading a couple GB files, or even just a few MB, the Internet speed becomes slower. What usually happen is that after using the Internet for a few days the internet is as slow as snail for the next few days (sometime up to a week). Is that due to bandwidth cap? Is there even such thing as "bandwidth cap" today?

Does more bandwidth help?
15Mbps down / 2Mbps up

This doesn't look good because upstream is only 1/7 of the downstream.




Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download