Forums  Register  Login  My Profile  Inbox  Address Book  My Subscription  My Forums 

Member List  Search  FAQ  Ticket List  Log Out

 

Storing duplicate attachments [UPDATED AGAIN]

 
Logged in as: Guest
Users viewing this topic: none
  Printable Version
All Forums >> [Archiving & Fax] >> GFI MailArchiver >> Storing duplicate attachments [UPDATED AGAIN] Page: [1] 2   next >   >>
Login
Message << Older Topic   Newer Topic >>
Storing duplicate attachments [UPDATED AGAIN] - 12.Mar.2005 6:33:00 PM   
chrishaas

 

Posts: 54
Joined: 20.May2004
Status: offline
I exported of the attachments from the SQL server off to disk, ran an SHA-512 off of them and removed the duplicates and found I was able to remove about 25% of the attachments, which in my case was about 250MB. While there's definately an overhead in creating a hash its usually very small and the mail attendant runs passively anyway so the footprint isn't really noticeable. Since the DB is bound to grow really fast a 25% reduction might be nice to have. For my tests I considered an attachment to be the same if it matched another with the same hash, size AND filename, although if using an algorithm like SHA-512 the last two are extraneous.

I don't know how this exactly fits into the SO act but if you use a FIPS-approved hash (http://csrc.nist.gov/CryptoToolkit/tkhash.html) I would think it would be fine.

I know this would be a schema update but I think a 25% gain in space would be worth it.

[Update]
I broke the 2GB limit today and realized that the queries I was running choked after 2GB so I've updated my code. The new limit is now 8EB so we're safe for a while. The link two posts down has been updated.

[Update 2]
See my post below for another update:
http://forums.gfi.com/cgi-bin/ultimatebb.cgi?ubb=get_topic;f=13;t=000092#000023

[ July 04, 2005, 06:48 PM: Message edited by: Chris Haas ]
Post #: 1
RE: Storing duplicate attachments [UPDATED AGAIN] - 15.Mar.2005 4:12:00 AM   
cameron

 

Posts: 6
Joined: 20.Dec.2004
Status: offline
Yeah I brought this up a while ago, but I was just fobbed off, and they didnt take much notice of it. So maybe you might have more luck than me!

(in reply to chrishaas)
Post #: 2
RE: Storing duplicate attachments [UPDATED AGAIN] - 15.Mar.2005 10:37:00 AM   
chrishaas

 

Posts: 54
Joined: 20.May2004
Status: offline
Did you run any tests to find out how much extra space was used? I wrote a quick utility that calculates everything that you can use if you want. I recommend running it on a backup of the server although I've successfully run it on a production one.
http://webteam.ovationmarketing.com/Chris/GFI.zip

(in reply to chrishaas)
Post #: 3
RE: Storing duplicate attachments [UPDATED AGAIN] - 15.Mar.2005 7:22:00 PM   
cameron

 

Posts: 6
Joined: 20.Dec.2004
Status: offline
I did initially after running GFI for a week, but I'll let the nightly backup run tonight and then give your little utility a whirl..looks cool, will post the results

(in reply to chrishaas)
Post #: 4
RE: Storing duplicate attachments [UPDATED AGAIN] - 21.Mar.2005 2:14:00 PM   
chrishaas

 

Posts: 54
Joined: 20.May2004
Status: offline
Any luck?

(in reply to chrishaas)
Post #: 5
RE: Storing duplicate attachments [UPDATED AGAIN] - 24.Mar.2005 6:53:00 AM   
PACEGMBH

 

Posts: 62
Joined: 23.Mar.2005
From: DE - Berlin
Status: offline
After using this Tool the script discussed here: http://forums.gfi.com/cgi-bin/ultimatebb.cgi?ubb=get_topic;f=13;t=000058
won't work anymore.

(in reply to chrishaas)
Post #: 6
RE: Storing duplicate attachments [UPDATED AGAIN] - 29.Mar.2005 9:43:00 AM   
chrishaas

 

Posts: 54
Joined: 20.May2004
Status: offline
Yeah, because that script relies on GFI to create the database and then uses "SELECT *" instead of the column names, one database has more columns than the other. you can fix it by changing the line:
code:
SELECT * from ['+@src+']..arc_att where [id]=@mid

to
code:
SELECT [id], [filename], [attdata], [attsize], [attid], [compressed]  from ['+@src+']..arc_att where [id]=@mid

that will drop the extra column that my program adds from the transfer.

alternately, if you want to completely remove the column you can run:
code:
ALTER TABLE [arc_att]
DROP COLUMN [cjh_sha512_hash]

assuming you ran it, how were the results that you got back?

[ March 29, 2005, 03:43 PM: Message edited by: Chris Haas ]

(in reply to chrishaas)
Post #: 7
RE: Storing duplicate attachments [UPDATED AGAIN] - 29.Mar.2005 10:40:00 AM   
PACEGMBH

 

Posts: 62
Joined: 23.Mar.2005
From: DE - Berlin
Status: offline
Thanks for the quick reply.

Rgds, PACEGMBH

(in reply to chrishaas)
Post #: 8
RE: Storing duplicate attachments [UPDATED AGAIN] - 30.Mar.2005 10:52:00 AM   
PACEGMBH

 

Posts: 62
Joined: 23.Mar.2005
From: DE - Berlin
Status: offline
Hi Chris,

thanks for your support. Think I should learn SQL [Wink]
Now the script runs properly again.

I ran the tool on two databases.
The first one was my test db which could be 7,73 % smaller if the attachments wouldn't be stored twice or even more.
The second dB is very small because it is brand new.
Here even 9,52 % could be saved.

Best regards,

John

(in reply to chrishaas)
Post #: 9
RE: Storing duplicate attachments [UPDATED AGAIN] - 2.Apr.2005 4:25:00 PM   
chrishaas

 

Posts: 54
Joined: 20.May2004
Status: offline
Thanks John. Can you tell me what the ammounts to in filesize?

(in reply to chrishaas)
Post #: 10
RE: Storing duplicate attachments [UPDATED AGAIN] - 4.Apr.2005 5:40:00 AM   
PACEGMBH

 

Posts: 62
Joined: 23.Mar.2005
From: DE - Berlin
Status: offline
Unfortunately we startet again with a brand new dB last friday. Since the dB isn't very big by now I'll tell you some numbers next friday.

Regards,
John

(in reply to chrishaas)
Post #: 11
RE: Storing duplicate attachments [UPDATED AGAIN] - 15.Apr.2005 4:38:00 AM   
PACEGMBH

 

Posts: 62
Joined: 23.Mar.2005
From: DE - Berlin
Status: offline
So, today after 2 weeks of archiving.
I've run the SHAPOC tool again.

Showing following output:

Current number of files : 2.674
Optimized number of files : 2.304
Difference : 370 (13,837%)

Current size of files : 429.661.051
Optimized size of files : 361.123.358
Difference : 68.537.693 (15,952%)

Not that worse but if the dB could be smaller I'd appreciate this.

Rgds,

John

(in reply to chrishaas)
Post #: 12
RE: Storing duplicate attachments [UPDATED AGAIN] - 18.Apr.2005 3:52:00 PM   
chrishaas

 

Posts: 54
Joined: 20.May2004
Status: offline
Thanks. Here's where I'm at now:

Current number of files : 20,537
Optimized number of files : 16,218
Difference : 4,319 (21.030%)

Current size of files : 2,683,522,928
Optimized size of files : 1,869,030,373
Difference : 814,492,555 (30.352%)

(in reply to chrishaas)
Post #: 13
RE: Storing duplicate attachments [UPDATED AGAIN] - 26.Apr.2005 2:34:00 AM   
PACEGMBH

 

Posts: 62
Joined: 23.Mar.2005
From: DE - Berlin
Status: offline
Hi Chris,

after attending a SQL Server course last week,
I changed the way our SQL Server runs.
Now it runs with a domain user account.

Unfortunately I'm no longer able to use the SHAPOC tool. It wont accept my logon credentials.

Have you got any hints how to solve that problem?

Regards,

John

(in reply to chrishaas)
Post #: 14
RE: Storing duplicate attachments [UPDATED AGAIN] - 29.Apr.2005 1:59:00 AM   
PACEGMBH

 

Posts: 62
Joined: 23.Mar.2005
From: DE - Berlin
Status: offline
Fortunately I moved the Mailarchiver database to a different server yesterday. Now I'm able to use your tool again:

After four weeks archiving:

Microsoft Windows [Version 5.2.3790]
(C) Copyright 1985-2003 Microsoft Corp.

K:\>d:

D:\>shapoc
The following program scans the GFI attachment database
looking for duplicates. In order to do so it adds a
column to the "arc_att" table called "cjh_sha512_hash".
It then scans the table and computes the hash for each
attachment and stores it in this field. It does _not_ change
the attachment in any way. Please read the README file
included with the program to better understand this
program.

Please backup your database before continuing.

During the question phase you can quit at any time by
entering q and pressing <enter>

Please enter the database name/IP:coruscant
Should I use your current NT credentials? [Y/n/q]:y

Successfully connected to SQL Server

Please enter the database name that contains
your GFI Mail Archiver database.
Enter @list to list databases, @quit to quit:mailarchiv

Database appears to be a valid
GFI Mail Archiver database
The next step will alter the physical schema
of the database. Do you wish for me
to continue? [Y/n/q]:#y
The next step will alter the physical schema
of the database. Do you wish for me
to continue? [Y/n/q]:y

Column already exists, continuing.

Should I begin creating SHA-512 hashed?
This process might take a while. [Y/n/q]
y

There are 5478 attachments totaling 1.139.453.202 bytes.

Current number of files : 5.478
Optimized number of files : 4.748
Difference : 730 (13,326%)

Current size of files : 1.139.453.202
Optimized size of files : 928.902.553
Difference : 210.550.649 (18,478%)

Regards, John

(in reply to chrishaas)
Post #: 15
Page:   [1] 2   next >   >>
All Forums >> [Archiving & Fax] >> GFI MailArchiver >> Storing duplicate attachments [UPDATED AGAIN] Page: [1] 2   next >   >>
Jump to:





New Messages No New Messages
Hot Topic w/ New Messages Hot Topic w/o New Messages
Locked w/ New Messages Locked w/o New Messages
 Post New Thread
 Reply to Message
 Post New Poll
 Submit Vote
 Delete My Own Post
 Delete My Own Thread
 Rate Posts