Right now i'm building a language centre DB. Is going to hold translations for data in tables in another DB (english DB). The idea is going to be that there is going to be a table in the Language DB for every language and table it is going to translate in the english DB.
So lets consider the following in the English DB:
PROJ_TBL_HELPTOPICS
-> PK_HELP_ID
-> TITLE
-> DESCR
PROJ_TBL_CATEGORIES
-> PK_CAT_ID
-> TITLE
-> DESCR
In the Language DB I want to hold translations for HELPPTOPICS and CATEGORIES, and I want translations for Spanish and Japanese.
PROJ_TBL_HELPTOPICS_ES
-> PK_TRANS_ID
-> FK_HELP_ID
-> TRANS_TITLE
-> TRANS_DESCR
The rest is going to be the layout as above
PROJ_TBL_HELPTOPICS_JA
PROJ_TBL_CATEGORIES_ES
PROJ_TBL_CATEGORIES_JA
The reasons I separated up the language DB from the english DB are:
1. English DB has, and is going to have a lot more tables, and is going to be heavily queried (plus I think dont think the translations are going to be used anywhere near as often and the english). I figured the less tables, where possible, the better.
2. Putting translations in different a different DB, I could take better advantage of colliations specific to a language for example when using Full-Text searching on Japanese text
Anyways, here's my question!?!
I want to link the foreign key column to the table it is translating primary key column (in the English DB). I want to be able to take advantage of Cascade on Delete. So when an item is deleted from EnglishDB.PROJ_HELP_TOPICS it is going to be deleted from LanguageDB.PROJ_HELP_TOPICS_[LANG ISO]. Is this done through Mirroring?
Maybe this a dumb question, but why don't you make the language a column on the table instead of having n number of tables? What u are doing doesn't seem necessary.
PROJ_TBL_HELPTOPICS
PK_HELP_ID
PK_HELP_LANG
TITLE
DESCRIPTION (why abbreviate?)
(don't you need a FK to the category?)
PROJ_TBL_CATEGORIES
PK_CAT_ID
PK_CAT_LANG
TITLE
DESCRIPTION
Unless you plan on putting the the different language databases on completely different servers i don't see the performance benefit in splitting it up. I do however see lots of headaches occuring as a result of de-normalising ur data structure. Not to mention your data access code, instead of only varying by one parameter (the lang - ie. "en") you instead need to write code that access different databases and different tables (by name). That's just nasty.
Having the lang part of the PK means the lang (and id) will be indexed. Access should be very fast. It may be best to put the LANG column before the id column.. depends on how you use it.
I dunno about your point about collations. Based on my limited understanding that may be relevant to simple stringA = stringB comparisons. That said japanese doesn't have things like different case etc so it's not really relevant... IMHO.
And in regards to your question, i dunno if FK's can work across seperate db's. My guess is NO!! If it was possible u'd need to create a cascading deletable FK from the english help topics table to all the other language databases help_topis table.
|||Remeber, the point of this forum is not trash one another, supposed to be making helpful hits and tips.
My reason for different table per language
I want the ability to store the last 10 versions of translation. So when translations are modified, I have the ability to go back and the last 10 versions of the same translation. In the past I've found that when I have a number of different translators come in, they have a tendancy to change each others' translations, and I want to be able to undo their translation to and be able to backtrack up to 10 versions of that same translation. Keeping that in mind, I was a little reluctant throwing that in a single table in the LanguageDB.PROJ_TBL_HELPTOPICS with a LANG_ISO column as the size of the table could potentially grow really big, and hurt performance in the long run.
Right now EnglishDB.PROJ_TBL_HELPTOPICS has 20,000 rows, if I have 5 languages, and going to have the ability keep up to 10 versions of a translation, that means LanguageDB.PROJ_TBL_HELPTOPICS can potientially grow up to 1,000,000 rows. I figured, if I split it up to language specific tables (ie PROJ_TBL_HELPTOPICS_[LANG ISO]) then I could limit each table size to only 10x that of the table it is translating; which I figured would help has this is going to be a Web Application and reasonibly high queried table.
My reason for Column abbreviation
This is a safety precaution; lets say down the link I'm moving to a scenario where I feel comfortable putting all of this in a single DB (ie migrating to Oracle), then I'll be able to tell the difference between a translation column and a english column when I'm doing a JOIN in a query.
Remote Foreign Key Constraints
I know there is a way of doing it through linked servers, I just wanted to get feedback from experienced DBA's to get their input as far as whether that is the right way to go about it as they might have real world examples where it worked, or didn't work.
Conclusion
I'm trying to design my database for performance as this has the potiential of becoming a web project with a huge user base, sorting, storing and updating of an enormous about of data. Of course, later on I'm going to hier proper DBA's to tweek the DBs for optimum performance, but I'm trying do as much as I can from the get go. I could be doing this wrong, but that's the reason why I'm posting it up here BEFORE i get started on it.
So, as I asked before, is DB mirroring / linking the way to go about this? Does anyone have any senario's where they've used mirroring or linking?
|||My apologies for the trash Tim, I've removed it.
Tim Dilbert:
My reason for different table per language
I want the ability to store the last 10 versions of translation. So when translations are modified, I have the ability to go back and the last 10 versions of the same translation. In the past I've found that when I have a number of different translators come in, they have a tendancy to change each others' translations, and I want to be able to undo their translation to and be able to backtrack up to 10 versions of that same translation. Keeping that in mind, I was a little reluctant throwing that in a single table in the LanguageDB.PROJ_TBL_HELPTOPICS with a LANG_ISO column as the size of the table could potentially grow really big, and hurt performance in the long run.
I hear what ur saying, you dont' want a massive table. You could create a historical table; create another table called LanguageDB.PROJ_TBL_HELPTOPICS_HISTORY with the same schema as LanguageDB.PROJ_TBL_HELPTOPICS. Remove the identity insert from the identity field and add a new field called DATECREATED. Then every time you save a new record to LanguageDB.PROJ_TBL_HELPTOPICS you can copy the old row to the LanguageDB.PROJ_TBL_HELPTOPICS_HISTORY. You can then easily get the last X historical records by filtering by ID & LAND and ordering by DATECREATED DESC. You could copy the old data to the HISTORY table by use of a TRIGGER (having only one table makes this easy to manage, i hate lots of triggers) OR you just copy from your application when you save.
This approach keeps your primary tables lean and also gives you a complete history.
Tim Dilbert:
My reason for Column abbreviation
This is a safety precaution; lets say down the link I'm moving to a scenario where I feel comfortable putting all of this in a single DB (ie migrating to Oracle), then I'll be able to tell the difference between a translation column and a english column when I'm doing a JOIN in a query.
I'm not sure I understand that but to each his own :) What DB are u currently using? I was assuming MS SQL.
Tim Dilbert:
Remote Foreign Key Constraints
I know there is a way of doing it through linked servers, I just wanted to get feedback from experienced DBA's to get their input as far as whether that is the right way to go about it as they might have real world examples where it worked, or didn't work.
Well i'm not an experienced dba. But i think it's indisputable, given your goal is to improve performance, running your application through linked servers is probably going to negate any performance benefit of smaller tables
Tim Dilbert:
So, as I asked before, is DB mirroring / linking the way to go about this? Does anyone have any senario's where they've used mirroring or linking?
Well if your heart's set on some kind of distributed architecture have u considered a sql cluster?
|||
Tim Dilbert:
Remeber, the point of this forum is not trash one another, supposed to be making helpful hits and tips.
...
I could be doing this wrong, but that's the reason why I'm posting it up here BEFORE i get started on it.
I think you are being perhaps a little over-sensitive; Sam Critchley's suggestions could save you a lot of grief in the future.
I do agree that this forum is not for trashing each other, but what do you do when your friend strikes a match to look for a gas leak? You say, here is a helpful hint or tip; don't do that, it's a bad idea, you'll regret it later.
Having tables for each language could get expensive, especially if you add a language or six. Adding a language would require code changes, an ideal opportunity to introduce a few bugs every now and again, always gives the users something to laugh about. Even other changes that didn't involve adding a language wold need to be made in several places instead of one.
I shudder at the idea of splitting the data not only into separate tables but separate databases. That is another opportunity for something to go wrong, for versions to get out of step with each other. Murphy's law has not yet been repealed.
Your reasons for splitting up the database seem to be about performance, not to have too many records, not to have too many tables.
Having big tables is not a problem for a proper industrial strength database. You said "potentially grow up to 1,000,000 rows" as if that was a lot. No. I have a database with over 200 tables, some of them with 20-30 million rows. One copy lives happily on my laptop for a current project, another copy supports 200 concurrent users actively updating on a clapped-out 12 year old machine with 192Mb of memory.
However, there is the point about using different collation for different languages. I really don't know how much difference that would make, but if it was going to be significant, that might be a good reason for splitting into separate tables.
And if you were going to have separate web sites for each language, that might be a justification for splitting into separate databases.
BUT
I would want to keep all the data for all languages in a single central database where you do all the updating and maintenance, and periodically extract data for a particular language, to refresh the separate language database. That would be a one-way flow, data would go out to the satellite databases, but no changes would be made there, and no data would come back. The separate single language database should have the same structure, the same table names and column names, as the master database, so that you only need to maintain one version of the code.
I think you should have a master table of languages, and language keys in the translation tables, all languages in the same tables. Your searches would include language as a parameter, and even a full-text search (and you should try to avoid them as much as possible) would only look at records with a particular language.
As Sam pointed out, you have a category table, but nothing else connects back to it? What is it for?
Have you given any thought to how you would identify which records were the 10 different translations of the same topic, and which was the latest one?
Should there be a version number?
Should there be a date added, maybe a date obsolete/superceded? I tend to add a date/time last updated and UserId_last_updated_by too.
Might you want to store the name of the translator, or a key to link to a translators table?
You didn't say how much text there was in each topic, minimum,maximum, average? Whether a line, a page, a fifty page article, war and peace? That's a very important design consideration. If more than a paragraph or two, I would consider separating the main text in one table and a bunch of keywords, search terms, categories, or whatever, maybe a summary, into another table. I would encourage users to search the summary and keywords rather than the full text, because it will be quicker; but still allowing the option to search the full text, but requiring another step, a conscious effort rather than the default.
It would be a good idea to structure the data so that you only search the current versions, so a current/obsolete flag could be in an index with the language.
Alternatively, keep obsolete stuff out of the way by moving it into an archive table, so it is out of the way when searching.
Optionally, don't save the full text of each version of the translation, save only one full text version, and the changes that need to be applied to get back to each of the other versions. Like a source code version system, in fact. You'd want to keep the latest version online in full, keep the current version in the main table, with a separate table for change history. That would keep the overall size down a bit.
Well, that's my tuppence worth.
Good luck with your project.
|||Alright, sorry for taking so long to get back but I've done some extensive research on this and found out the answers.
Seems like my idea of splitting data into language specific tables is the right thing to do. In SQL Server 2005, you can set collations to the server level, database level and column level. If I was to put all translations into a central table referencing languages through a "lang" column, I wouldn't be able to specify a collation (ie. Chinese_Hong_Kong_90_xxxx, French_CS_AS etc) since a collation on a row basis is not available.
Collations & Full Text
When building a Multilingual application, if you want to fully take advantage of the features that SQL Server has to offer, you should be using a collation specific to that language the appropriate word breakers, stemmers, and iFilters are used at Index and Query time. You can specify with language you'd like to use when creating your full-text calatogue (though it doesn't say it in the article I'm linking here) I did read somewhere else, it is still a good idea to use the proper collation so your words are stemmed properly. For example, in German "H?user" (houses) would also be steamed as "haeuser" (houses) and "haus" (house) in a German collation with German full-text calatloge, however I think it only gets steamed as "H?user" (houses) or "haeuser" (houses) with Latin1_General_CI_AS but a German Full-text cataloge (but I COULD BE WRONG).
Conclusion
After everything I am liking the suggestions that LockH made, putting the older versions of translations in another table is a really good idea. That's going to save me a hell of a lot of headaches later on. Translations in the archieved table will have two time stamps on them, the date the translation was inputed into the DB, and the date it was replaced. The new translation is going to go into the LanguageDB.TBL_HELP_TOPICS_TCH, updating the row that's in there.
Linked Server Overhead
Now as far as the over-head of using Linked Servers. I think you all mis-understood me when i said that i want to use linked servers. I'm not talking about doing selects through a link server, the link would only be used in a trigger when I delete is done so it's cascade down to the translations attached to the help topic. In SQL server you can dynamically concat a query string, then use 'EXECUTE'.
-- soexecute'select * from TBL_LANGS';-- that will execute the same as thisselect *from TBL_LANGS
Keeping that in mind, I was able to make a stored proceedure which is going to be attached to 'after delete' trigger on each of the tables, passing the table name and the PK of row deleted, and the stored proceedure loops through all the languages in the EnglishDB.TBL_LANGS table deleting the rows in LanguageDB.TBL_HELP_TOPICS_[LANG ISO] table related to the delete row. Since this is only going to be done on deletes, then the overhead is minimal considering how helpful it is going to be managing the data in LanguageDB.
In the EnglishDB.TBL_LANGS there is a column which is going to tell C# which DB to connect to to get the translation from. Since you open and close connections for every query you make to SQL Server it is no extra overhead other than the extra query (which I think is unavoidable because you'd be doing that call anyways whether all the translations are in one database or not).
Anyways, heres the articles I read, which are actually amazing. I think you should read this entirely because you even get started designing any DB for multilingual content.
http://www.simple-talk.com/sql/learn-sql-server/sql-server-full-text-search-language-features/ - Part 1
http://www.simple-talk.com/sql/learn-sql-server/sql-server-full-text-search-language-features,-part-2/ - Part 2
I wish more people would reply to this article. It'd be good if there where some suggestions for which collation is the best to use for which language. I haven't really been able to find anyone with real-world examples.
|||
After everything I am liking the suggestions that LockH made, putting the older versions of translations in another table is a really good idea.
Errm... i actually suggested that.. with the timestamp... (second post first paragraph)
Well good luck.
No comments:
Post a Comment