|
Cool.
|
|
|
|
|
You will usually always see a significant lower transfer rate for smaller file sizes.
Look at the graph at the bottom of my article for my NAS box, you will see a much lower transfer rate for small block sizes compared to larger block sizes.
QNAP NAS Memory Upgrade, Hardware Change and Performance Benefits[^]
Also in the article, I have a included some benchmark software I use, and an excel spreadsheet template i use for tracking benchmarks between mods etc.
|
|
|
|
|
(hope this forum is right then
Hi all,
I'm about to begin a small project in which I must be able to store and lookup up to 20. mio. files - in the best possible way. Needless to say fast.
For this I have been around -
http://en.wikipedia.org/wiki/NTFS#Limitations
http://www.ntfs.com/ntfs_vs_fat.htm
And now my question: Dealing with a production load in the area around 60.000 files (=pictures) per day each around 300 kb in size, what would be the best ratio of what number of files in what number of directorys to make the search-time best? Obviously I do not put all the files in one directory, but in a number of dir's. So what would be the best economy for such a thing?
Seems to be hard to find information about on the web.
Thanx' in advance,
Kind regards,
Michael Pauli
|
|
|
|
|
Hi,
1.
I tend to limit the number of files per folder to 50 ot 100. In my experience it is not very relevant if you never need to browse the folder with say Windows Explorer, so when your app knows which file to access, it does not matter. If you can group the files logically (say by topic), then by all means do so. OTOH if you have to open the folder in Explorer, especially on a remote computer, things may slow down considerably when the folder holds hundreds of files/folders or more. If so, use a two-stage or three-stage organization; with maximum N files/folder, that can hold N*N or N*N*N files.
2.
Search what? file content? file names? partial file names? If file names, then again, organize a multi-level folder hierarchy based on what matters most to you (could be the first and second character of the file names).
3.
Whatever is is you really need, just give it a try. In a matter of minutes a test app could create and store a huge number of files (real or dummy), and you could experiment with the result.
PS: I'm sure all this is in the wrong forum, it isn't hardware related, is it?
|
|
|
|
|
Seriously, use a database instead. You're losing very little storage space and winning so much on the lookup. Atleast if it's properly indexed.
|
|
|
|
|
Hi Jörgen,
I totally agree about your comment, but my customer want to use a filesystem and not a Oracle db etc. I really don't understand why, but it is something about maintenance and backup I'm told.
Kind regards,
Michael Pauli
|
|
|
|
|
Yeah, that's utter bullshit.
Your customer is going to find that that method will be non-performant and limited as well as very easy to screw up while doing "maintenance".
The more files and directories you shove into the directory structure, the slower a single search is going to get. Indexing won't help much as the indexes will be limited to the properties of the files themselves as well as the metadata stored in the image files.
The more files and directories you add is going to make the NTFS data structures grow and grow, eventually taking up gigabytes of space, slowing your machines boot time, and if something should happen to those tables, God help you when performing a CHKDSK on it. Bring a cot to sleep on.
The backup argument is also garbage as it's just easy to backup a database as it is to backup the massive pile of debris you're about to litter the drive with.
|
|
|
|
|
Very nice and clear summary.
|
|
|
|
|
Dave Kreskowiak wrote: The more files and directories you shove into the directory structure, the
slower a single search is going to get. Indexing won't help much as the indexes
will be limited to the properties of the files themselves as well as the
metadata stored in the image files.
Not sure I understand that.
I am rather certain that both MS SQL Server and Oracle provide for a file based blob storage mechanism. And of course using a url string for a blob entry is an option for any database. There are tradeoffs as to whether one wants to keep it in the database or file system.
And it isn't that hard to implement at least some simplistic indexing schemes if one doesn't want to use a database. That requires using another file, it doesn't require searching the files themselves. And if one was using a database then one would still have to export the meta data from the files. If one didn't then I wouldn't be surprised if attempting to extract meta data from image blobs would be slower with a database.
Dave Kreskowiak wrote: The more files and directories you add is going to make the NTFS data structures
grow and grow, eventually taking up gigabytes of space, slowing your machines
boot time
What does the storage requirement have to do with anything? If you store something in a database it takes space too.
I have never heard anyone make that claim about any OS slowing down. Could you provide a reference?
|
|
|
|
|
jschell wrote: Not sure I understand that.
In order to search 20,000,000 files and have a request return something in your lifetime, you better have the Indexing service turned on and your app better be using it.
Check the OP. He's specifically avoiding using a database because of stupid customer requirements.
jschell wrote: What does the storage requirement have to do with anything?
The size of the NTFS tables on disk grows and grows with the number of files and folders you stick in the volume. Directory entries take up on space on the disk.
Not so much if you put everything into a database since the database is only a few files.
jschell wrote: I have never heard anyone make that claim about any OS slowing down. Could you
provide a reference?
Don't have to. Think about it. The NTFS tables take up memory. The bigger you make those tables, the more memory that's going to be eaten up and less available for apps. Of course, what affect this has depends on how much memory is in the machine.
I meant to say that the server will take longer and longer to boot, not necessarily slow down the app once everything is loaded and running.
You want documentation? Try it yourself. Load up your C: drive with 20,000,000 files in a few thousand folders, reboot your machine and watch what happens. To take it a bit further, try scheduling a CHKDSK and reboot. Don't forget to have a pot of coffee standing by.
|
|
|
|
|
Dave Kreskowiak wrote: In order to search 20,000,000 files and have a request return something in your
lifetime, you better have the Indexing service turned on and your app better be
using it.
In order to search the image data of 20 million blobs in a database it is going to take just as long and probably longer.
The only way to avoid that in the database is to extract the meta data from the images and store it somewhere else in the database.
And again one can do exactly the same thing with a file based system.
Dave Kreskowiak wrote: The size of the NTFS tables on disk grows and grows with the number of files and
folders you stick in the volume. Directory entries take up on space on the
disk.
The size of the database on the disk grows with the number of blobs you stick in it.
So how exactly is that different?
Dave Kreskowiak wrote: Don't have to. Think about it. The NTFS tables take up memory. The bigger you
make those tables, the more memory that's going to be eaten up and less
available for apps. Of course, what affect this has depends on how much memory
is in the machine.
That isn't how any modern file system works.
It doesn't load the entire file system into memory. Matter of fact the database is going to load more into memory than the file system will. Quite a bit more unless you constrain it.
Not that it would matter anyways since it would be using virtual memory.
Dave Kreskowiak wrote: I meant to say that the server will take longer and longer to boot
That clarifies it for me - I don't believe that. Please provide a reference. Provide a reference that refers to booting the machine.
(Since I was interested I also determined that I have over 500,000 files on my personal development computer. If there was in fact some impact then I would certainly expect that a server class machine with a server class file system would in fact be able to handle more files than a personal dev box.)
|
|
|
|
|
Hi Dave!
Thank you for your opinion. I must say I tend to go your way here, but to avoid any problems of a more political nature I go for the file sys. solution. I my career I've never done a thing like that and I find it hard to write - even though it's simplistic by nature.
To begin with we go for 500 directories each holding 500 sub dirs. each holding 500 sub dirs. That is 500³ = 125,000,000. I'm having a server for this so it's not on my locale dev. pc.
My feeling is that we would be better off having an Oracle db or likewise for it. But the decision is made
Thanx' again.
Kind regards,
Michael Pauli
|
|
|
|
|
Maintenance and backup are among the best reasons to use a database.
Tell them to educate their staff.
Daves summary is spot on in my opinion.
|
|
|
|
|
Yaa sure - I agree but some technicians here would like to have this filebased and not put in a database for some more or less obscure reasons. So they get what they want. I have less than a week left on this assignment ... if you get my point
Kind regards,
Michael Pauli
|
|
|
|
|
Jörgen Andersson wrote: You're losing very little storage space and winning so much on the lookup.
+5, there's probably little worries about fragmentation as the pictures do not change, and it'd be the fastest solution to retrieve a blob
Bastard Programmer from Hell
|
|
|
|
|
In terms of general design....
Following site is nice for articles on exactly what the name suggests.
http://highscalability.com/[^]
Here is one that you might find more specifically relevant. There are others there about flickr as well.
http://highscalability.com/flickr-architecture[^]
Michael Pauli wrote: in the best possible way. Needless to say fast.
((8 hour business day) * 3600 seconds/hour)/60,000 = 0.48 requests per second.
That by itself doesn't require much of a "fast" look up.
And exactly what the look up consists of is probably more relevant. Using a file architecture is probably more relevant to accessing it rather than looking it up.
And once you have it you must still serve it back to the caller. Which is going to be a non-trivial cost.
If one uses a direct url mapping then there are probably other optimization strategies such as some sort of grouping in terms of where pics are on the hard drive versus attempting to optimize directory size which would provide a more measurable impact. Although I wonder if it would be significant. I would also expect such strategies to be impacted (if measurable) by the actual hard drive chosen.
|
|
|
|
|
Maybe a document repository tool would be a good solution. Here is one that I use extensively, although just the free version os far. You would need the full version to allow more concurrent users and unlimited document count.
http://www.m-files.com/eng/home.asp[^]
The repository offers classification of files, fast searching and pretty much considers the "physical" location irrelevant and instead considers all documents to be in the "bag".
Also, a big plus, it doesn't use window's network mapping (or whatever its formal name is), it is instead over tcp (IMGIC).
|
|
|
|
|
I am not much of a hardware guy. I am not sure where to ask this Q, so if I am in the wrong area please be kind and re-direct.
Work issues me a laptop, with docking station. Good, I can get two monitors (laptop which I have a hard time reading and a second which is much easier on my eyes). Can I get a USB or other connector if I want a second monitor to this setup? thereby having two monitors and the laptop display?
My other option is to request a desktop, in which case any guidance on what to make sure I have to do this? (i.e. support three monitors)
My
no-e
|
|
|
|
|
Dell has a docking station that supports the use of two external monitors.
Otherwise you already answered your own question. There's always the USB option. No gaming or movies on that one though.
|
|
|
|
|
Some desktops have 3 output built-in (VGA, DVI and HDMI)... but whenever possible uses digital connection (DVI or HDMI) instead of analog (VGA). When screen are next to other, the difference of quality is visible when screen are side-to-side. For the same reason, you would prefer that each screen would be of similar quality.
I have 2 screens (running in Full HD resolution) and I rarely use the second one. Only occasionnaly when I want to compare things.
I would not recommand an USB monitor as the performance is far behind. I once tried one small picture frame that could be used as a monitor and I returned because I was not satisfied as it perform realy poorly.
Full HD resolution is highly recommanded if you use Visual Studio.
Philippe Mori
|
|
|
|
|
I am not much of a hardware guy, not sure where to ask this Q, so if I am in the wrong area please be kind and re-direct.
work issues me a laptop, with docking station. Good, I can get two monitors (laptop which I have a hard time reading and a second which is much easier on my eyes). Can I get a USB or other connector if I want a second monitor to this setup? thereby having two monitors and the laptop display?
no-e
|
|
|
|
|
(original post in the Lounge[^])
In summary, I'm building a new PC from parts (from tom's hardware, $1000 build) and got problems when pluging in the graphic card(s) (2 of them)
I've replaced the power supply with a much more powerful one, 1000W, and still cannot get the system to boot; I can't even see the bios/efi screen, not even the motherboard splash screen; all fans works, so there is power to all the components, including the graphic cards.
My system is up and running with my old graphic card, so I know the hardware is functional and there is not bad connection or bad components other than the graphic cards.
I got 2 of the same graphic card (gigabyte AMD hd 6850 1gb), and both are not working; I assume that I'm don't have two bad cards...
Current system :
Motherboard : MSI P67A-G43
CPU: intel i5 2500k
Power Supply : Antec TruePower Quattro 1000W
Memory : G.Skill Ripjaws 4 GB
Graphic Card : 2 x Gigabyte Radeon HD 6850 1 GB GDDR5
HD : Western Digital Caviar Black 750 GB
DVD : asus sata dvd rw.
The whole system is working perfectly if I use my old nvidia 8800gts card.
Tomorrow, I will try to see if there is a PC at work that support PCIE cards to plug in the cards to see if I can at least know if they work or not.
I will also contact MSI and Gigabyte ( or AMD ) to see if there is something particular I need to do on the hardware part of it.
Maybe there is something I have to set or enable in the bios/efi to allow the cards to work.
Any more ideas ?
Thanks.
Max.
Watched code never compiles.
|
|
|
|
|
Maximilien wrote: all fans works, so there is power to all the components, including the graphic cards. Maybe. As I pointed out in the Lounge, some PCIE graphics cards need the extra "top" power connected. It may be a matter of split power rails rather than paralleled. (I haven't got one to play with, so I can't confirm this.)
Asking Gigabyte sounds a real good idea.
Cheers,
Peter
Software rusts. Simon Stephenson, ca 1994.
|
|
|
|
|
Peter_in_2780 wrote: some PCIE graphics cards need the extra "top" power connected. It may be a matter of split power rails rather than paralleled. (I haven't got one to play with, so I can't confirm this.)
Well, there is only one power plug on each card, I plug in the dedicated pci-e plug from the power supply, each one with a different cable.
I don't know what a split power rail is ... translation or image ?
Watched code never compiles.
|
|
|
|
|
According to the ATX spec you can only put a relatively limited number of amps on a single part of the 12v power generation hardware inside the PSU (called rails). Early high wattage 12V centric PSUs followed this restriction and split 12V into two or more rails, with the potential result that if you tried to pull too much 12V from a certain subset of the plugs it would fail because you maxed out one rail even though you were well short of the total limit. The fun is that the rail structure was rarely (if ever) documented; so unless your particular PSU was reviewed and dismembered by a EE on a site like Jonny Guru/[^] you'd have no way of knowing what the rails were except by trial and error. Most (all?) new PSUs simply disregard the rail amperage restrictions and put all 40, 60, 80, etc amps on a single rail to make it easier for users. (I don't know if they put any sort of current limiting hardware on a per cable level; avoiding yanking currents high enough to melt wires down a single cable was part of the reason behind splitting rails).
You can fiddle around with different PCIe plugs on different cables; but with a new PSU it's unlikely to be an issue. Especially since split rail designs still should be able to run several PCIe plugs/rail and your 6850's only use a single plug each.
Did you ever see history portrayed as an old man with a wise brow and pulseless heart, waging all things in the balance of reason?
Is not rather the genius of history like an eternal, imploring maiden, full of fire, with a burning heart and flaming soul, humanly warm and humanly beautiful?
--Zachris Topelius
|
|
|
|
|