|
I'm using a postgresql database and managing it becomes harder and harder due to the sheer size of it.
the largest table contains about 2.5-3 billion records and there are some tables reaching about 10 million records. Database size is about 1.2 TB on disk. I believe the large table (and the tables around it) is the most problematic since there are (very roughly) 10 million write statements a day.
I have had issues with the transaction id's overflowing. Luckily by now I know the cure, but I would like to avoid this issue (and any other issue!) if possible. Especially since the database is expected to grow even further. VACUUM takes like forever on that large table.
I did a search on managing large databases, but most results are for other databases. Only found a not so helpful presentation on slideshare.
If anyone has good (practical) tips about keeping such a large database into good shape, please let me know.
(I hope I explained things well enough)
|
|
|
|
|
With that amount of data, I take it you already know enough about indexing etc so let's not go to those.
Depending what problem you're solving the technique might be different. Based on your description I'd concentrate on partitioning the large table (or tables) first. Partitioning would help you to isolate the hot spot of the table to a significantly smaller amount of rows so that changes in indexes and on disk would affect a considerable smaller amount of data.
For more information have a look at http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html[^]
As a side note, I hate to say this but I would also consider changing the database product
|
|
|
|
|
Oracle? I know there is a free Oracle version (and a free SQL-Server version), but I'm not sure they'll do the job. Paying for it is a no go here (has to be as cheap as possible), so I'm down with or MySQL or PostgreSQL and I'll take the latter any day .
Indexing, I'm not really an expert, but the indexes are limited to 2 (excluding the implicit index on the primary key). So if you have tips there, please shoot.
For this application I'm the designer, architect, tester, developer and, you guessed it, DBA despite my limited knowledge of being a 'real' dba. (there are little or no people doing better at our company so no choice there)
I'll have a look at partitioning.
[EDIT]Ooh, and thanks ! [/EDIT]
|
|
|
|
|
V. wrote: I know there is a free Oracle version (and a free SQL-Server version), but I'm not sure they'll do the job
The free one, no. It wouldn't be sufficient because of the limitations for the database size etc. So you'd have to go for the paid one...
One interesting option could be https://mariadb.org/[^]. If you have time at some point, have a look.
|
|
|
|
|
I think your strategy on indexing is probably a good one, you can kill a database with too many indexes...
As Mika suggested partitioning is going to be your best bet, but you should also look at archiving (a more drastic form of partitioning).
You should look at just what you are getting out of all those records and what use they are being put to! Is it possible to create a highly summarised reporting database.
Never underestimate the power of human stupidity
RAH
|
|
|
|
|
unfortunately all 3.5 billion records should be readily available. The reading part is not really the issue since I limit the recordset to maximum 30 000 records through the client applications and there is a downsample algorithm in place as well (it´s timeline data - (timestamp,value) pairs). It's the writing part (transactions id's, vacuum, ...) that's problematic.
Yes, my boss had some high demanding requirements
|
|
|
|
|
V. wrote: all 3.5 billion records should be readily available I think your boss has not thought through the requirements. Readily may not mean instantly.
Never underestimate the power of human stupidity
RAH
|
|
|
|
|
instantly, immediately, blazing fast, ... whatever .
But reading is pretty OK. In fact, the product for the users works like a charm. I´m just having trouble with importing data every day for this specific records set. (because it is so huge).
thanks for the advice, still simmering on it a little.
|
|
|
|
|
3.5 billion records aren't anything strange. It's the amount that changes that's the problem.
And since you also don't have any problems selecting the data, we have to look at the other side of the problem.
Firstly we have to see the fact that Insert time is close to linear[^] with the number of indexes. So keeping that number low is important.
But it's also very unlinearly depending on some other factors.
So I have some questions:
Do you have any unused surrogate keys?
Are you inserting data in parallell or from more than one process at a time?
Are you inserting data at one end of the index only? (Using ID, sequence or date for example)
Are you just inserting, or are you updating and deleting at the same time aswell?
|
|
|
|
|
I have worked with 300 million row medical datasets where analysis jobs took two days to run.
A lot can depend on how the data is modelled - sometimes a very flat table structure can help where in effect you denormalise the data - this is a very useful way of modelling the data if you are just trawling through one table. At other times more EAV style tables can speed things up.
A lot of it depends on what you are doing with the data - there is no simple answer as to how to structure/re-structure data as much of it comes down to some experience and trial and error.
One thing you may want to consider is when running queries on the data - have an initial set of queries that will extract the general population of your data that you are working on. Then build tables with indexes on this 'candidate' data and work from there. This way you may be slimming down your data from 3 billion rows to a few million rows.
You probably do not want to be running queries on a 3billion row recordset other than for the purposes of extracting your initial data which you then work on in separate tables.
This is how I did things with 300 million rows, although I realise that this is smaller by a factor of 10 compared to your dataset.
[edit] I noticed you also mention in a later post that writing is an issue.
Your limitation is going to be down to hardware. Depending on your budget splashing out an a decent SAN solution may help - although SAN technology is beyond my current level of experience, I am just someone who uses it while other people configure it.
Be sure that the disks where the database lives are not shared with the OS or anything else that may cause contention when reading/writing data.
“That which can be asserted without evidence, can be dismissed without evidence.”
― Christopher Hitchens
modified 18-Jul-15 7:31am.
|
|
|
|
|
GuyThiebaut wrote: Be sure that the disks where the database lives are not shared with the OS or anything else that may cause contention when reading/writing data.
Cool, currently still shared, but I´m getting bigger disks this summer on another machine, so that should fix that.
I'm actually happy that it looks 'normal' to suffer with this, we have some developers here (well, doctors in mathematics and such) who react like "how could that be a problem?", which annoys the hell out of me .
Still taking in advice and letting it simmer a bit, to see if I can get a solution out of it .
thx for the advice.
|
|
|
|
|
V. wrote: how could that be a problem?", which annoys the hell out of me . Unfortunately, as you are discovering, when it comes to these sorts of issues experience wins over technical proficiency.
V. wrote: (well, doctors in mathematics and such) sometimes one way to handle people's expectations when they have a science background is to explain the science of disk access and data storage, the chances are that being scientists they will understand and value a scientific explanation, that way you can win them over to your side. I say this as someone who works in a research company where many of those in the company have PhDs, when the 'database is slow' I always hear from them.
I think many people still forget that computers have moving parts, they are machines - they are not some sort of quantum flux probability engine that returns results at the speed of light.
3 billion records is a huge dataset - even Cern has intermediary machines that filter out data on the go as experiments are run(worth checking this out on youtube), as so much data is generated that they could not possible store or analyse it all.
Good luck and be sure to run tests before and after so that you can prove that there has been a speed up.
“That which can be asserted without evidence, can be dismissed without evidence.”
― Christopher Hitchens
|
|
|
|
|
I know how to update a single record, but updating multiple records is different.
Single
Dim uTemplate As CRM_JOBS = context.crm_Jobs.Single(Function(m) m.templateName = templateName)
uTemplate.templateXHTML = sEMT.templateXHTML
uTemplate.templateID = pValue
context.SaveChanges()
I thought of this for multiple records, but items doesn't carry over to the loop.
So would I have to project it into something and then save?
Dim jobs =
From items In context.crm_Jobs
Where items.templateName = templateName
For Each items In jobs
jobs.templateXHTML = sEMT.templateXHTML
jobs.templateID = pValue
Next
context.SaveChanges()
[edit]
I have this now
Dim jobs =
From j In context.crm_Jobs
Where j.templateName = templateName
Select j
For Each j As CRM_JOBS In jobs
j.templateXHTML = sEMT.templateXHTML
j.templateID = sEMT.templateID
Next
context.SaveChanges()
Oh that works!
modified 15-Jul-15 17:32pm.
|
|
|
|
|
Hi,
I have below code:
WHILE i < total_rows DO
SET @param_employee_number = (SELECT employee_number FROM earned_leaves LIMIT i,1);
PREPARE query_statement FROM 'CALL sp_populate_leave_summary(?)';
EXECUTE query_statement USING @param_employee_number;
SET i = i + 1;
END WHILE;
I want to save the value returned by the EXECUTE into a variable in order to use it in the next UPDATE statement.
How can I do this please?
Thanks,
Jassim[^]
Technology News @ www.JassimRahma.com
|
|
|
|
|
Taken this is MySql, you could modify the stored procedure to have an OUT parameter. In such case you could then use two variables in your execute. So something like:
PREPARE query_statement FROM 'CALL sp_populate_leave_summary(?,?)';
EXECUTE query_statement USING @param_employee_number, @another_parameter;
|
|
|
|
|
I am getting:
Procedure execution failed
1048 - Column 'earned_leave' cannot be null
although the sp_populate_leave_summary returns the correct value as I can see in the result window.
this is the code after adding an OUT:
WHILE i < total_rows DO
SET @param_employee_number = (SELECT employee_number FROM earned_leaves LIMIT i,1);
PREPARE query_statement FROM 'CALL sp_populate_leave_summary(?,?)';
EXECUTE query_statement USING @param_employee_number, @param_eligible_days;
UPDATE earned_leaves SET earned_leave = @param_eligible_days WHERE employee_number = @param_employee_number;
SET i = i + 1;
END WHILE;
my OUT is:
OUT param_eligible_days int
Technology News @ www.JassimRahma.com
|
|
|
|
|
As the error says, the return value from your procedure to variable @param_eligible_days is null. Because of this null is being updated to the table and this is prohibited because the column is not null.
So check why sp_populate_leave_summary returns NULL for the second parameter.
|
|
|
|
|
Just noting that the code posted suggests that a loop should not be used, thus the question that derives from it is moot.
SQL (as most dbs implement it) will do queries and updates on sets. The loop is implicit in it. It is just a matter of structuring it.
|
|
|
|
|
I'm trying to convert this to Linq. I didn't see a OR in Linq, so I'm figuring you can do multiple Where stateements. And on the date, I'm not sure if I should calculate it in code and jsut submit the date, or if there is a way to use Linq to do Date Calcs.
" SELECT COUNT(*) " & _
" FROM CRM_MESSAGES " & _
" WHERE " & _
" MessageStatus='MESSAGE_COMPLETE' " & _
" OR MessageStatus='MESSAGE_PENDING' " & _
" OR MessageStatus='MESSAGE_FORWARD' " & _
" AND ContactDate >=DATEADD(day, -30, getdate()) "
So Far I've got this
Dim messages As IQueryable(Of CRM_MESSAGES)
messages = messages.Where(Function(m) m.MessageStatus = "MESSAGE_COMPLETE")
messages = messages.Where(Function(m) m.MessageStatus = "MESSAGE_PENDING")
messages = messages.Where(Function(m) m.MessageStatus = "MESSAGE_FORWARD")
messages = messages.Where(Function(m) m.ContactDate >= m.ContactDate.Date.AddDays(-30))
|
|
|
|
|
Multiple Where statements is equivalent to an AND operator. For an Or operator, use VB's OrElse operator:
messages = messages.Where(Function(m) m.MessageStatus = "MESSAGE_COMPLETE" OrElse m.MessageStatus = "MESSAGE_PENDING" OrElse m.MessageStatus = "MESSAGE_FORWARD")
Your date comparison won't do anything - you're asking for all messages where the contact date is later than a date 30 days before the contact date, which is all of them. Use Date.Today.AddDays(-30) instead:
messages = messages.Where(Function(m) m.ContactDate >= Date.Today.AddDays(-30))
"These people looked deep within my soul and assigned me a number based on the order in which I joined."
- Homer
|
|
|
|
|
I would of never figured that out on the multiple where statements.
And the Date bombed, so I tried DateTime.Now.AddDays(-30), but changed to your example before testing it.
I wonder if this will now work without using IQueryable? I'll give it a spin. It didn't work
LINQ to Entities does not recognize the method 'System.DateTime AddDays(Double)' method, and this method cannot be translated into a store expression.
I'll go back to IQueryable and give it a try.
Dim pValue As Integer = 0
Using context As New CRMContext()
pValue = _
(From item In context.CRM_MESSAGES
Where item.MessageStatus = "MESSAGE_COMPLETE" _
OrElse item.MessageStatus = "MESSAGE_PENDING" _
OrElse item.MessageStatus = "MESSAGE_FORWARD" _
And item.ContactDate >= Date.Today.AddDays(-30)).Count()
End Using
Return pValue
|
|
|
|
|
DBFunctions for EF 6.0, Thanks!
Imports System.Data.Entity
Dim pValue As Integer = 0
Using context As New CRMContext()
pValue = _
(From item In context.CRM_MESSAGES
Where item.MessageStatus = "MESSAGE_COMPLETE" _
OrElse item.MessageStatus = "MESSAGE_PENDING" _
OrElse item.MessageStatus = "MESSAGE_FORWARD" _
And item.ContactDate >= DbFunctions.AddDays(DateTime.Now(), -30)).Count()
End Using
Return pValue
|
|
|
|
|
jkirkerx wrote: LINQ to Entities does not recognize the method 'System.DateTime AddDays(Double)' method, and this method cannot be translated into a store expression.
You might need to move the date outside of the filter:
Dim minDate As Date = Date.Today.AddDays(-30)
messages = messages.Where(Function(m) m.ContactDate >= minDate)
Alternatively, use the EntityFramework.SqlServer assembly, and the System.Data.Entity.SqlServer.SqlFunctions class[^]:
messages = messages.Where(Function(m) m.ContactDate >= SqlFunctions.DateAdd("day", -30, SqlFunctions.GetDate()))
"These people looked deep within my soul and assigned me a number based on the order in which I joined."
- Homer
|
|
|
|
|
So I thought I was getting this down now.
In my original code, I used 1 function to get the name, and another function to get the record results.
So I rewrote it Linq, But I'm not quite getting how it works.
So the first query called contributor returns a result that looks like SQL sort of, and says to expand to see the results.
The 2nd query does the same thing, and the pValue for the record count is non zero, and triggers an exception of non zero.
So I'm missing something on the concept here, and can't figure it out.
I shortened the sample for easier reading.
Public Shared Function load_movieContributor_array( _
ByVal pMovieID As Integer,
ByVal pPageIndex As Integer,
ByVal pPageSize As Integer,
ByRef pResults As IEnumerable(Of MovieIndex)) As Integer
Dim pValue As Integer = 0
Dim context As New MoviesContext()
Dim contributor = _
(From m In context.Movies
Where m.MovieID = pMovieID
Select m.MovieContributor)
Dim movies As IQueryable(Of Movies) = context.Movies
movies = movies.Where(Function(m) m.MovieContributor = contributor)
movies = movies.OrderBy(Function(m) m.MovieName)
'Get a Single Page of Data
If (pPageIndex = 1) Then
movies = movies.Take(pPageSize)
Else
movies = movies.Skip((pPageIndex - 1) * pPageSize).Take(pPageSize)
End If
pResults = _
movies.AsEnumerable().Select(Function(item) New MovieIndex With
{
.MovieID = item.MovieID,
'More columns .....
})
pValue = pResults.Count()
Return pValue
End Function
|
|
|
|
|
Public Shared Function load_movieContributor_array( _
ByVal pMovieID As Integer,
ByVal pPageIndex As Integer,
ByVal pPageSize As Integer,
ByRef pResults As IEnumerable(Of MovieIndex)) As Integer
Dim pValue As Integer = 0
Dim context As New MoviesContext()
Dim cResults = _
(From m In context.Movies
Where m.MovieID = pMovieID
Select m.MovieContributor).SingleOrDefault()
Dim m_contributor As String = cResults
Dim movies As IQueryable(Of Movies) = context.Movies
movies = movies.Where(Function(m) m.MovieContributor = m_contributor)
movies = movies.OrderBy(Function(m) m.MovieName)
'Get a Single Page of Data
If (pPageIndex = 0) Then
movies = movies.Take(pPageSize + 1)
Else
movies = movies.Skip((pPageIndex + 1) * pPageSize).Take(pPageSize)
End If
|
|
|
|