Jump to content

I'm writing a new RDBMS engine - who wants to join

- - - - -

This topic has been archived. This means that you cannot reply to this topic.
11 replies to this topic

#1
JCoder

JCoder

    Programming Professional

  • Members
  • PipPipPipPipPip
  • 245 posts
Hello,

I'm writing a new RDBMS engine (just for fun of writing). If anyone is interested, we could make it a new open source project, just as MySQL or PostgreSQL.

The unique features of my project are:
- autonomic database tuning and adaptation to workload
- autonomic memory configuration
- strong query planner, much better than that of other opensource engines
- native replication and HA
- true transaction serializability

First, I want tomake it performance oriented and easy to use, and NOT feature-rich (so don't expect support for zillions of types, spatial indexes or stored procedure programming languages).

I have already written some parts of it in Java (SQL parser and the query planner) . Anyone wants to join?

#2
debtboy

debtboy

    Programming God

  • Members
  • PipPipPipPipPipPipPip
  • 916 posts
Great Project...

I'm of no help writing,
but give me a shout when you need a tester. ;)

#3
TkTech

TkTech

    The Crazy One

  • Moderators
  • 1,396 posts

Quote

I have already written some parts of it in Java (SQL parser and the query planner).

Enough said.

#4
JCoder

JCoder

    Programming Professional

  • Members
  • PipPipPipPipPip
  • 245 posts

Quote

Quote

I have already written some parts of it in Java (SQL parser and the query planner).

Enough said.

Could you elaborate more on this? If you tried to sound ironic, you are badly mistaken - Java is a perfect language to write RDBMS in, much better than C or C++. A naive nested loops join hand optimized in assembly will be always much slower than a buffered nested loops join with hashing coded in Java. These are the algorithms that make RDBMS fast, not the language it is implemented in.

And the query planner I wrote for a research project is much better than the PostgreSQL's* - although it is slower, but produces much better plans (e.g. considers bushy join trees, rewrites correlated subselects into joins, and leverages limit clause to favour pipelining).


*) It is quite easy to be better - PostgreSQL planner is really dumb - e.g it cannot estimate selectivity of such simple queries as: SELECT * FROM table WHERE a < b. Or when you add a LIMIT 10 clause, it proposes to compute everything and then output only first 10 rows. Or it cannot use index-only scans (ok, in fact this is a poor MVCC implementation problem, not the planner's - indexes have no version info, so every index scan has to fetch rows from the indexed tables). MySQL is even more dumb (ancient rule based planner).

#5
TkTech

TkTech

    The Crazy One

  • Moderators
  • 1,396 posts
Haha, java is absolutely not the language to use to write a performance RDBMS. It isn't the language to write a performance anything.

A. Virtual Machine. Right there, you've hit an unstoppable performance block.
B. Big Endian. Everything in Java is big endian. Working with the OS or reading configuration files, saving SQL dumps, ect... will all have to be flipped every time you read/write it.
C. Why would you write your loop in assembly? Thats back in the day where C compilers were dumber then humans. GCC or VC usually knows better then you do.
D. Loops in Java will never, ever, ever be faster then the equivalent loop in C. If you benchmarked this and somehow Java came out on top, you need to re-learn C.

Java is OK for certain things. And now that its open source and getting active community work, it'll just keep getting better. But producing a performance critical application is not in its job description. A stability critical application, maybe.

And don't even go the portability route. Its very easy to make your C/C++ program portable.

#6
JCoder

JCoder

    Programming Professional

  • Members
  • PipPipPipPipPip
  • 245 posts
Some benchmarks show you are totally wrong.
H2 Java RDBMS IS faster than both MySQL and PostgreSQL in case of most queries tested in open-source benchmarks. This does not prove Java is faster than C, but it proves it is possible to write just as fast or even faster system in Java in much shorter time (H2 has been written by one guy in 2 years). Another comparison - Tomcat vs Apache webserver (Tomcat is slightly faster - check yourself if you don't believe).

It seems you don't have a slightest idea how database systems work.
Big endianess is not a performance issue - disks are orders of magnitude slower than memory - so flipping these bytes is unnoticeable. Just try reading a large binary file in Java and in C - the performance is exactly the same.
And having consistent binary format can save you lot of troubles when moving database from one server to another (software in C or C++ cannot do this efficiently if architectures differ).

On the contrary, stability is ALSO a performance issue. How fast is a RDBMS that has memory leaks?

Java also HAS some big performance advantages over C and C++ when it comes to multithreading and locking (also very important in real RDBMSes). Can your optimizing C++ compiler do biased locking or lock coarsening optimizations?

However, the biggest issue is the productivity you can gain wrinting software in high level languages - in the same time you can write a better query planner and more joining algorithms or index types implementations than in low/middle level languages like C or C++. Algorithms are the most important thing affecting performance of the RDBMS. Both MySQL (C++) and PostgreSQL © have algorithms from the previous decade and are developed very slowly.

#7
TkTech

TkTech

    The Crazy One

  • Moderators
  • 1,396 posts
Well, before this turns into a Java VS Misc war, I'm going to be an ass.

You are wrong, and you will find it out the hard way. Enjoy. In 10 years if you prove me wrong, I'd be glad to hear about it.

#8
JCoder

JCoder

    Programming Professional

  • Members
  • PipPipPipPipPip
  • 245 posts
I do not have to prove anything, because there are already Java RDBMSes written in Java, faster than that in C or C++, at least when it comes to open-source things - so it has already been proved.

When we switched JBoss JMS from using HSQLDB to PostgreSQL we got about 10 times slowdown, though the strict durability of transactions in PostgreSQL (fsync) was switched off, so nothing prevented it from caching all reads and writes in memory. Both the configurations were fully transactional and were writing data to disk.

BTW: Java is a natively compiled language and in strictly numeric benchmarks has very similar performance to C/C++. In most pessimistic cases it loses about 50%, but in most optimistic it also wins about the same (especially when it comes to creating lots of small objects on the heap - C++ is a terrible loser). Also heap memory consumption is very similar, though it is hard to measure (benchmark measurements in Internet are especially unfair - they include permgen and JVM memory in the reported memory consumption in case of Java but skip memory taken by the OS and dynamically linked libraries in case of C++ programs).

#9
debtboy

debtboy

    Programming God

  • Members
  • PipPipPipPipPipPipPip
  • 916 posts

JCoder said:

I do not have to prove anything, because there are already Java RDBMSes written in Java
Do you have a link?
Maybe I can be a tester for those systems. :rolleyes:
:lol:

I think TkTech is a pretty sharp guy and you seem to be also,
so I don't understand why either of you are bothering with this
discussion, for that matter... I have better things to do also,
good luck with your project JCoder.

#10
JCoder

JCoder

    Programming Professional

  • Members
  • PipPipPipPipPip
  • 245 posts
I cannot give links, because the forum software doesn't allow me to.
But write "H2 database" and "HSQLDB database" into Google and click the first result.

It is astonishing, that a single guy could do something like H2 in a little over 2 years. HSQLDB has great performance if the whole database fits in memory. And if not, raw computational speed (CPU cycles) almost does not matter - large databases are all about doing I/O efficiently. This can be done with almost any programming language. I would risk saying, it is possible to write a fast RDBMS in Python or Ruby (but I don't like using dynamically typed languages for such large and complex projects).

TkTech still hasn't given any technical arguments to support his point of view (except the "endianess problem", which is rather a great feature* and not a performance problem at all), so yes - this performance offtopic is pointless.


**) Migrating a database is as easy as copying the database folder to another machine, forgeting that the source machine was Windows on x86 and the target one is Solaris / UltraSPARC or 64bit Linux on AMD Athlon... No C/C++ database system can do this. ;)

#11
WingedPanther

WingedPanther

    A spammer's worst nightmare

  • Moderators
  • 16,831 posts
There's an important issue before this gets out of hand: Are you comparing apples to oranges? It's not enough to say H2 (which is nice) is faster than MySQL. H2 is designed to have a single client connected to the database at a time, MySQL supports hundreds or thousands of connections. What features does H2 support? Does it have triggers? How many data types does it support? What built in functions does it support? etc. etc. etc.

To fully compare two databases, you need to compare more than just transaction speed. Scalability and features are also important. I can create a C++ program that will out-perform Fortran in performing complex math... as long as the only math you want to do is add 1+1.

Also, given that H2 and HSQLDB already exist, what do you plan to do that would be different from them? Why not contribute to those projects?
Programming is a branch of mathematics.
My CodeCall Blog | My Personal Blog

#12
JCoder

JCoder

    Programming Professional

  • Members
  • PipPipPipPipPip
  • 245 posts
See the first post.

The priorities are much different for my project than for H2 and HSQLDB.
HSQLDB is designed as a lightweight embedded in-memory database system.
H2 is trying to be a feature-rich system, just like PostgreSQL or MySQL. And it is not true it cannot handle thousands of connections. It can when started in the server mode. However I cannot say which one scales better. I only know MySQL scales quite poorly due to its table-level locking and unstable replication utilities, H2 is probably similar (also table-level locking, replication only in master-slave mode).

Mine is going to be a scalable, autonomic RDBMS. Which means it will adapt to workload, automatically create indexes or partition data, or tune memory settings. This affects its architecture, so directly joining H2 is rather not possible, but it is probably possible to take some parts of code (license permits it).

Regarding apples-to-oranges comparison: all the features you mention can be implemented without affecting performance. Triggers do not incur any overhead when one doesn't use them in the benchmark. The same applies to various datatypes and builtin functions.
Even ACID support can be implemented in such a way, that when you don't need it, it does not slow down anything (unfortunately this is not the case of PostgreSQL). So this is not an excuse for lost benchmarks.