Storing duplicate attachments [UPDATED AGAIN]
|
Logged in as: Guest
|
|
Users viewing this topic:
none
|
|
Login | |
|
Storing duplicate attachments [UPDATED AGAIN] - 12.Mar.2005 6:33:00 PM
|
|
|
chrishaas
Posts: 54
Joined: 20.May2004
Status: offline
|
I exported of the attachments from the SQL server off to disk, ran an SHA-512 off of them and removed the duplicates and found I was able to remove about 25% of the attachments, which in my case was about 250MB. While there's definately an overhead in creating a hash its usually very small and the mail attendant runs passively anyway so the footprint isn't really noticeable. Since the DB is bound to grow really fast a 25% reduction might be nice to have. For my tests I considered an attachment to be the same if it matched another with the same hash, size AND filename, although if using an algorithm like SHA-512 the last two are extraneous.
I don't know how this exactly fits into the SO act but if you use a FIPS-approved hash (http://csrc.nist.gov/CryptoToolkit/tkhash.html) I would think it would be fine.
I know this would be a schema update but I think a 25% gain in space would be worth it.
[Update] I broke the 2GB limit today and realized that the queries I was running choked after 2GB so I've updated my code. The new limit is now 8EB so we're safe for a while. The link two posts down has been updated.
[Update 2] See my post below for another update: http://forums.gfi.com/cgi-bin/ultimatebb.cgi?ubb=get_topic;f=13;t=000092#000023 [ July 04, 2005, 06:48 PM: Message edited by: Chris Haas ]
|
|
|
|
RE: Storing duplicate attachments [UPDATED AGAIN] - 15.Mar.2005 4:12:00 AM
|
|
|
cameron
Posts: 6
Joined: 20.Dec.2004
Status: offline
|
Yeah I brought this up a while ago, but I was just fobbed off, and they didnt take much notice of it. So maybe you might have more luck than me!
|
|
|
|
RE: Storing duplicate attachments [UPDATED AGAIN] - 15.Mar.2005 10:37:00 AM
|
|
|
chrishaas
Posts: 54
Joined: 20.May2004
Status: offline
|
Did you run any tests to find out how much extra space was used? I wrote a quick utility that calculates everything that you can use if you want. I recommend running it on a backup of the server although I've successfully run it on a production one. http://webteam.ovationmarketing.com/Chris/GFI.zip
|
|
|
|
RE: Storing duplicate attachments [UPDATED AGAIN] - 15.Mar.2005 7:22:00 PM
|
|
|
cameron
Posts: 6
Joined: 20.Dec.2004
Status: offline
|
I did initially after running GFI for a week, but I'll let the nightly backup run tonight and then give your little utility a whirl..looks cool, will post the results
|
|
|
|
RE: Storing duplicate attachments [UPDATED AGAIN] - 21.Mar.2005 2:14:00 PM
|
|
|
chrishaas
Posts: 54
Joined: 20.May2004
Status: offline
|
Any luck?
|
|
|
|
RE: Storing duplicate attachments [UPDATED AGAIN] - 29.Mar.2005 9:43:00 AM
|
|
|
chrishaas
Posts: 54
Joined: 20.May2004
Status: offline
|
Yeah, because that script relies on GFI to create the database and then uses "SELECT *" instead of the column names, one database has more columns than the other. you can fix it by changing the line:
code:
SELECT * from ['+@src+']..arc_att where [id]=@mid
to
code:
SELECT [id], [filename], [attdata], [attsize], [attid], [compressed] from ['+@src+']..arc_att where [id]=@mid
that will drop the extra column that my program adds from the transfer.
alternately, if you want to completely remove the column you can run:
code:
ALTER TABLE [arc_att] DROP COLUMN [cjh_sha512_hash]
assuming you ran it, how were the results that you got back? [ March 29, 2005, 03:43 PM: Message edited by: Chris Haas ]
|
|
|
|
RE: Storing duplicate attachments [UPDATED AGAIN] - 29.Mar.2005 10:40:00 AM
|
|
|
PACEGMBH
Posts: 62
Joined: 23.Mar.2005
From: DE - Berlin
Status: offline
|
Thanks for the quick reply.
Rgds, PACEGMBH
|
|
|
|
RE: Storing duplicate attachments [UPDATED AGAIN] - 2.Apr.2005 4:25:00 PM
|
|
|
chrishaas
Posts: 54
Joined: 20.May2004
Status: offline
|
Thanks John. Can you tell me what the ammounts to in filesize?
|
|
|
|
RE: Storing duplicate attachments [UPDATED AGAIN] - 4.Apr.2005 5:40:00 AM
|
|
|
PACEGMBH
Posts: 62
Joined: 23.Mar.2005
From: DE - Berlin
Status: offline
|
Unfortunately we startet again with a brand new dB last friday. Since the dB isn't very big by now I'll tell you some numbers next friday.
Regards, John
|
|
|
|
RE: Storing duplicate attachments [UPDATED AGAIN] - 15.Apr.2005 4:38:00 AM
|
|
|
PACEGMBH
Posts: 62
Joined: 23.Mar.2005
From: DE - Berlin
Status: offline
|
So, today after 2 weeks of archiving. I've run the SHAPOC tool again.
Showing following output:
Current number of files : 2.674 Optimized number of files : 2.304 Difference : 370 (13,837%)
Current size of files : 429.661.051 Optimized size of files : 361.123.358 Difference : 68.537.693 (15,952%)
Not that worse but if the dB could be smaller I'd appreciate this.
Rgds,
John
|
|
|
|
RE: Storing duplicate attachments [UPDATED AGAIN] - 18.Apr.2005 3:52:00 PM
|
|
|
chrishaas
Posts: 54
Joined: 20.May2004
Status: offline
|
Thanks. Here's where I'm at now:
Current number of files : 20,537 Optimized number of files : 16,218 Difference : 4,319 (21.030%)
Current size of files : 2,683,522,928 Optimized size of files : 1,869,030,373 Difference : 814,492,555 (30.352%)
|
|
|
|
RE: Storing duplicate attachments [UPDATED AGAIN] - 26.Apr.2005 2:34:00 AM
|
|
|
PACEGMBH
Posts: 62
Joined: 23.Mar.2005
From: DE - Berlin
Status: offline
|
Hi Chris,
after attending a SQL Server course last week, I changed the way our SQL Server runs. Now it runs with a domain user account.
Unfortunately I'm no longer able to use the SHAPOC tool. It wont accept my logon credentials.
Have you got any hints how to solve that problem?
Regards,
John
|
|
|
|
RE: Storing duplicate attachments [UPDATED AGAIN] - 29.Apr.2005 1:59:00 AM
|
|
|
PACEGMBH
Posts: 62
Joined: 23.Mar.2005
From: DE - Berlin
Status: offline
|
Fortunately I moved the Mailarchiver database to a different server yesterday. Now I'm able to use your tool again:
After four weeks archiving:
Microsoft Windows [Version 5.2.3790] (C) Copyright 1985-2003 Microsoft Corp.
K:\>d:
D:\>shapoc The following program scans the GFI attachment database looking for duplicates. In order to do so it adds a column to the "arc_att" table called "cjh_sha512_hash". It then scans the table and computes the hash for each attachment and stores it in this field. It does _not_ change the attachment in any way. Please read the README file included with the program to better understand this program.
Please backup your database before continuing.
During the question phase you can quit at any time by entering q and pressing <enter>
Please enter the database name/IP:coruscant Should I use your current NT credentials? [Y/n/q]:y
Successfully connected to SQL Server
Please enter the database name that contains your GFI Mail Archiver database. Enter @list to list databases, @quit to quit:mailarchiv
Database appears to be a valid GFI Mail Archiver database The next step will alter the physical schema of the database. Do you wish for me to continue? [Y/n/q]:#y The next step will alter the physical schema of the database. Do you wish for me to continue? [Y/n/q]:y
Column already exists, continuing.
Should I begin creating SHA-512 hashed? This process might take a while. [Y/n/q] y
There are 5478 attachments totaling 1.139.453.202 bytes.
Current number of files : 5.478 Optimized number of files : 4.748 Difference : 730 (13,326%)
Current size of files : 1.139.453.202 Optimized size of files : 928.902.553 Difference : 210.550.649 (18,478%)
Regards, John
|
|
|
|
New Messages |
No New Messages |
Hot Topic w/ New Messages |
Hot Topic w/o New Messages |
Locked w/ New Messages |
Locked w/o New Messages |
|
Post New Thread
Reply to Message
Post New Poll
Submit Vote
Delete My Own Post
Delete My Own Thread
Rate Posts |
|
|