We need good open-source language tools

helgihg · Post by **helgihg** » Sun Apr 22, 2007 3:51 pm

Hi.

This is going to be a long one. First, this has nothing to do with money. I'm not offering a job.

If I were, this would be in the job-section but it's in the language-section for a reason.

Last year, when I had been in Finland fo about a month, I started working on a tool to help me analyzing Finnish. I eventually got a job so I didn't have time to continue it and in a frantic episode of stupidity, I've managed to lose the code. However, I want to start it again because I'm quite certain that it's not difficult at all.

Here are the technical basics.

* Name *
I like the name "Lingbase", short for "Linguistics Base", but I'm open to suggestions. The components will be called for example "Lingbase Dictionary", "Lingbase Analyzer" and so forth, but the whole project will be called Lingbase, unless someone has better ideas.

* PHP *
It will be PHP. Interested developers, please, let's not argue about this one.

PHP is a very easy language, it's easy to learn and easy to maintain. Virtually any kind of programmer can do decent PHP, and that's the primary reason for me selecting this language. Furthermore, and perhaps more importantly, the PHP license is perfect for open-source development (whereas Java for example, is not, at least not yet). If you can do Java, C or C++, you can do PHP.

* MySQL *
I don't really care what database will be used, but I still think MySQL or PostgreSQL are good choices since invariably, PHP programmers know either or both. I'll start with MySQL, but using PostgreSQL later on shouldn't really be a problem, and likely not even a necessity.

* Linux *
Of course it will run on Linux because Linux 0wnz u noob. Of course, a programmer can use whatever operating system they like, but the software itself will be designed to run on Linux specifically (which means that virtually any Unix-clone will work just as well).

And now I'll rant a bit on with what exactly I have in mind.

* The trick *
The "trick" to this piece of software will be what I call user-development. A good example is MediaWiki (which Wikipedia runs on), where the user-community itself takes care of maintaining the information itself. Developers only have to know a few words of Finnish because the software will be community-based. In fact, the idea is to help people learn Finnish, so people that are actually in the process of learning Finnish are perfect for this.

Actually, the software will be a few components that will work together.

* Dictionary *
This is a bit more complex than it sounds. This is the underlying component on which all other components depend. The problem here is that all forms of each base word must be included and marked specifically what kind of bending they are. For example, "asua" (to live as in "I live in Finland") is a base, the word you find in any dictionary. "Asun", "asut", "asuvat" and so forth are bendings of this word, and those must be included too. The exact bending must also be registered, for example "asun" is 'first-person singular' and "asumme" is 'first-person plural'. What makes things even more complex, is that the definitions of these bendings must also be configurable because the idea is to support multiple languages from the beginning, and properties of languages are notoriously variable..

Example.

Let's say that we want to add the word "asua" to our dictionary. We'll begin with the native form, which is just "asua" (although the so-called stem is presumably "asu").

There will be switches (configurable by language) that a user marks. Typical switches are "person:first", "count:plural" and so forth. I'm not sure how to explain this properly, I hope it's clear already.

Furthermore, it should support at least basic colloqual versions. For example "mä" as "minä", "sä" as "sinä", "onx" as "on[ko] se" and so forth. This feature must recognize that a colloqual word may be split up into multiple words, like "onx". This feature is pretty low-priority though, in case it becomes a serious problem to implement it well.

* Analyzer *
This is the part that I started almost a year ago. It will take Finnish text as input, and analyze it according to the definitions of the dictionary. It will go word-by-word and try to find out what the native form of the word is, along with its type (noun, verb, adjective and so forth) and switches.

"Minä olen Helgi ja mä asun Turussa".

"Mina", "on [type:verb] [person:first] [count:singular]", "Helgi [name or unknown]", "ja [connector]" "minä [type:verb] [special:colloqual] [person:first] [count:singular]".

I hope you understand the abovementioned example. You might have to read it a couple of times to fully grasp what I mean.

The app will then color-code words according to their type, and offer a direct translation (which will generally not make much sense but that's okay), a translation where the words are turned into their native form, and a color-coded version where words are colored by type. A fourth version will be a detailed analysis, listing all known properties of the word. This, I've found, is the fastest way to understanding complex Finnish sentences but ironically (and seemingly contradictory) it's also the most time-consuming, requiring the most effort. It's "fast" in another sense, because you learn things quite well that way, reducing need for old-school practice.

* Trainer *
This will be similar in concept to FinnishSchools except that users will be able to put in examples just like in the other components. The user is presented with a word and a few options which he must guess correctly. This is probably by far the easiest part of the project.

* Last but not least... *
I need to know if there are any programmers here that are interested in this project. From years of experience with open-source development, it should be noted that it will take off slowly, and only after a very careful planning phase. If you're interested, you will have to read the specifications that I will create later after I've received some input from y'all.

If you're just interested but don't think you'll have a lot of time, that's okay. Things will just have to take the time they take as with software development in general. We're in no hurry, there's no deadline.

And now, I'll let you flame this whole thing.

If something's unclear, just ask for an explanation and I'll explain it as well as I can.

**Sponsor:**

helgihg · Post by **helgihg** » Sun Apr 22, 2007 3:59 pm

I'm putting this in a reply because it's really a different topic.

Hosting: I kinda like SourceForge but a few of my friends think Google Code is better, but I honestly don't know. Any input on good open-source project hosting would be great.

Furthermore, if this actually becomes reality, I'm sure we can get those bloodsuckers in congress to offer hosting later on and possibly advertising, encouraging the oh-so-valuable user input, that is to say, the actual information on words.

Finland Forum

Find information about moving to, living in and life in Finland