|
Hello friends! How is the day going? I need opinions in this case. I'm working on a social network site and as you all know there are many modules involved e.g friends, chat, messages table etc. My question is should I create database for each e.g chat db, messages db, friends db or I should create one database then create table for each of them e.g tblchat, tblmessages etc. Which of the option above is ok when managing social network?
|
|
|
|
|
While possible, I expect that any of those options will prove equally challenging to you.
|
|
|
|
|
You need to decide where one set of records needs a connection or relationship to another. If they have a relationship then it makes sense for them to be in separate tables of a single database.
|
|
|
|
|
If you are expecting extreme volumes you should investigate the storage methods specifically for that environment. Otherwise have them all reside in the same database.
If you need separation then use schemas, this will make it easier to split the database in the future if your client base grows significantly.
Never underestimate the power of human stupidity
RAH
|
|
|
|
|
As already answered the man criteria would be; Is there a relation between the data? If there is, a single database is the most logical choice.
However, there is also other things to consider such as maintenance, backups and so on. So even if the data isn't related, from the maintenance point of view it could simplify the situation if all the data is in one place. Of course if the amount of data grows huge over time you may have to separate it but in the beginning it would make sense to start small.
|
|
|
|
|
Hello friends! How is the day going? I need opinions in this case. I'm working on a social network site and as you all know there are many modules involved e.g friends, chat, messages table etc. My question is, shoulf
|
|
|
|
|
Where did your question go?
There are only 10 types of people in the world, those who understand binary and those who don't.
|
|
|
|
|
I'm using a postgresql database and managing it becomes harder and harder due to the sheer size of it.
the largest table contains about 2.5-3 billion records and there are some tables reaching about 10 million records. Database size is about 1.2 TB on disk. I believe the large table (and the tables around it) is the most problematic since there are (very roughly) 10 million write statements a day.
I have had issues with the transaction id's overflowing. Luckily by now I know the cure, but I would like to avoid this issue (and any other issue!) if possible. Especially since the database is expected to grow even further. VACUUM takes like forever on that large table.
I did a search on managing large databases, but most results are for other databases. Only found a not so helpful presentation on slideshare.
If anyone has good (practical) tips about keeping such a large database into good shape, please let me know.
(I hope I explained things well enough)
|
|
|
|
|
With that amount of data, I take it you already know enough about indexing etc so let's not go to those.
Depending what problem you're solving the technique might be different. Based on your description I'd concentrate on partitioning the large table (or tables) first. Partitioning would help you to isolate the hot spot of the table to a significantly smaller amount of rows so that changes in indexes and on disk would affect a considerable smaller amount of data.
For more information have a look at http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html[^]
As a side note, I hate to say this but I would also consider changing the database product
|
|
|
|
|
Oracle? I know there is a free Oracle version (and a free SQL-Server version), but I'm not sure they'll do the job. Paying for it is a no go here (has to be as cheap as possible), so I'm down with or MySQL or PostgreSQL and I'll take the latter any day .
Indexing, I'm not really an expert, but the indexes are limited to 2 (excluding the implicit index on the primary key). So if you have tips there, please shoot.
For this application I'm the designer, architect, tester, developer and, you guessed it, DBA despite my limited knowledge of being a 'real' dba. (there are little or no people doing better at our company so no choice there)
I'll have a look at partitioning.
[EDIT]Ooh, and thanks ! [/EDIT]
|
|
|
|
|
V. wrote: I know there is a free Oracle version (and a free SQL-Server version), but I'm not sure they'll do the job
The free one, no. It wouldn't be sufficient because of the limitations for the database size etc. So you'd have to go for the paid one...
One interesting option could be https://mariadb.org/[^]. If you have time at some point, have a look.
|
|
|
|
|
I think your strategy on indexing is probably a good one, you can kill a database with too many indexes...
As Mika suggested partitioning is going to be your best bet, but you should also look at archiving (a more drastic form of partitioning).
You should look at just what you are getting out of all those records and what use they are being put to! Is it possible to create a highly summarised reporting database.
Never underestimate the power of human stupidity
RAH
|
|
|
|
|
unfortunately all 3.5 billion records should be readily available. The reading part is not really the issue since I limit the recordset to maximum 30 000 records through the client applications and there is a downsample algorithm in place as well (it´s timeline data - (timestamp,value) pairs). It's the writing part (transactions id's, vacuum, ...) that's problematic.
Yes, my boss had some high demanding requirements
|
|
|
|
|
V. wrote: all 3.5 billion records should be readily available I think your boss has not thought through the requirements. Readily may not mean instantly.
Never underestimate the power of human stupidity
RAH
|
|
|
|
|
instantly, immediately, blazing fast, ... whatever .
But reading is pretty OK. In fact, the product for the users works like a charm. I´m just having trouble with importing data every day for this specific records set. (because it is so huge).
thanks for the advice, still simmering on it a little.
|
|
|
|
|
3.5 billion records aren't anything strange. It's the amount that changes that's the problem.
And since you also don't have any problems selecting the data, we have to look at the other side of the problem.
Firstly we have to see the fact that Insert time is close to linear[^] with the number of indexes. So keeping that number low is important.
But it's also very unlinearly depending on some other factors.
So I have some questions:
Do you have any unused surrogate keys?
Are you inserting data in parallell or from more than one process at a time?
Are you inserting data at one end of the index only? (Using ID, sequence or date for example)
Are you just inserting, or are you updating and deleting at the same time aswell?
|
|
|
|
|
I have worked with 300 million row medical datasets where analysis jobs took two days to run.
A lot can depend on how the data is modelled - sometimes a very flat table structure can help where in effect you denormalise the data - this is a very useful way of modelling the data if you are just trawling through one table. At other times more EAV style tables can speed things up.
A lot of it depends on what you are doing with the data - there is no simple answer as to how to structure/re-structure data as much of it comes down to some experience and trial and error.
One thing you may want to consider is when running queries on the data - have an initial set of queries that will extract the general population of your data that you are working on. Then build tables with indexes on this 'candidate' data and work from there. This way you may be slimming down your data from 3 billion rows to a few million rows.
You probably do not want to be running queries on a 3billion row recordset other than for the purposes of extracting your initial data which you then work on in separate tables.
This is how I did things with 300 million rows, although I realise that this is smaller by a factor of 10 compared to your dataset.
[edit] I noticed you also mention in a later post that writing is an issue.
Your limitation is going to be down to hardware. Depending on your budget splashing out an a decent SAN solution may help - although SAN technology is beyond my current level of experience, I am just someone who uses it while other people configure it.
Be sure that the disks where the database lives are not shared with the OS or anything else that may cause contention when reading/writing data.
“That which can be asserted without evidence, can be dismissed without evidence.”
― Christopher Hitchens
modified 18-Jul-15 7:31am.
|
|
|
|
|
GuyThiebaut wrote: Be sure that the disks where the database lives are not shared with the OS or anything else that may cause contention when reading/writing data.
Cool, currently still shared, but I´m getting bigger disks this summer on another machine, so that should fix that.
I'm actually happy that it looks 'normal' to suffer with this, we have some developers here (well, doctors in mathematics and such) who react like "how could that be a problem?", which annoys the hell out of me .
Still taking in advice and letting it simmer a bit, to see if I can get a solution out of it .
thx for the advice.
|
|
|
|
|
V. wrote: how could that be a problem?", which annoys the hell out of me . Unfortunately, as you are discovering, when it comes to these sorts of issues experience wins over technical proficiency.
V. wrote: (well, doctors in mathematics and such) sometimes one way to handle people's expectations when they have a science background is to explain the science of disk access and data storage, the chances are that being scientists they will understand and value a scientific explanation, that way you can win them over to your side. I say this as someone who works in a research company where many of those in the company have PhDs, when the 'database is slow' I always hear from them.
I think many people still forget that computers have moving parts, they are machines - they are not some sort of quantum flux probability engine that returns results at the speed of light.
3 billion records is a huge dataset - even Cern has intermediary machines that filter out data on the go as experiments are run(worth checking this out on youtube), as so much data is generated that they could not possible store or analyse it all.
Good luck and be sure to run tests before and after so that you can prove that there has been a speed up.
“That which can be asserted without evidence, can be dismissed without evidence.”
― Christopher Hitchens
|
|
|
|
|
I know how to update a single record, but updating multiple records is different.
Single
Dim uTemplate As CRM_JOBS = context.crm_Jobs.Single(Function(m) m.templateName = templateName)
uTemplate.templateXHTML = sEMT.templateXHTML
uTemplate.templateID = pValue
context.SaveChanges()
I thought of this for multiple records, but items doesn't carry over to the loop.
So would I have to project it into something and then save?
Dim jobs =
From items In context.crm_Jobs
Where items.templateName = templateName
For Each items In jobs
jobs.templateXHTML = sEMT.templateXHTML
jobs.templateID = pValue
Next
context.SaveChanges()
[edit]
I have this now
Dim jobs =
From j In context.crm_Jobs
Where j.templateName = templateName
Select j
For Each j As CRM_JOBS In jobs
j.templateXHTML = sEMT.templateXHTML
j.templateID = sEMT.templateID
Next
context.SaveChanges()
Oh that works!
modified 15-Jul-15 17:32pm.
|
|
|
|
|
Hi,
I have below code:
WHILE i < total_rows DO
SET @param_employee_number = (SELECT employee_number FROM earned_leaves LIMIT i,1);
PREPARE query_statement FROM 'CALL sp_populate_leave_summary(?)';
EXECUTE query_statement USING @param_employee_number;
SET i = i + 1;
END WHILE;
I want to save the value returned by the EXECUTE into a variable in order to use it in the next UPDATE statement.
How can I do this please?
Thanks,
Jassim[^]
Technology News @ www.JassimRahma.com
|
|
|
|
|
Taken this is MySql, you could modify the stored procedure to have an OUT parameter. In such case you could then use two variables in your execute. So something like:
PREPARE query_statement FROM 'CALL sp_populate_leave_summary(?,?)';
EXECUTE query_statement USING @param_employee_number, @another_parameter;
|
|
|
|
|
I am getting:
Procedure execution failed
1048 - Column 'earned_leave' cannot be null
although the sp_populate_leave_summary returns the correct value as I can see in the result window.
this is the code after adding an OUT:
WHILE i < total_rows DO
SET @param_employee_number = (SELECT employee_number FROM earned_leaves LIMIT i,1);
PREPARE query_statement FROM 'CALL sp_populate_leave_summary(?,?)';
EXECUTE query_statement USING @param_employee_number, @param_eligible_days;
UPDATE earned_leaves SET earned_leave = @param_eligible_days WHERE employee_number = @param_employee_number;
SET i = i + 1;
END WHILE;
my OUT is:
OUT param_eligible_days int
Technology News @ www.JassimRahma.com
|
|
|
|
|
As the error says, the return value from your procedure to variable @param_eligible_days is null. Because of this null is being updated to the table and this is prohibited because the column is not null.
So check why sp_populate_leave_summary returns NULL for the second parameter.
|
|
|
|
|
Just noting that the code posted suggests that a loop should not be used, thus the question that derives from it is moot.
SQL (as most dbs implement it) will do queries and updates on sets. The loop is implicit in it. It is just a matter of structuring it.
|
|
|
|