MOM Formats and Collections Scope

Master Objects Migration Project Plan Home

Scope

Location

In Scope?

Dark Archive

 

/archive/Archived

No

/archive/Committees

No

/archive/Dept/ARV

Yes

/archive/Dept/ATH

Yes

/archive/Dept/CGA

Yes

/archive/Dept/CSS

No

/archive/Dept/DI

Yes

/archive/Dept/HIL

Yes

/archive/Dept/KB

Yes, as it pertains to MO determinations in other collection folders

/archive/Dept/MUS

Yes

/archive/Dept/RAR

Yes

/archive/Dept/SRI

No, this is just a mapping to /archive/Dept/KB

/archive/Dept/TRI

Yes

/archive/Dept/WIT

No

/archive/Fedora

No

/archive/FedoraMORdata

No

/archive/lost+found

No

J-Drive

Yes, as it pertains to MO determinations in DA folders

External Media

Yes

This analysis accounts for files noted as "Yes" in the above table that are in the Dark Archive. It does not currently account for those in the J-Drive or External Media.

Detailed data can be found in: fileTypeCount-20151105-rev-dwn-20151106.xlsx

Collections

For this analysis of materials in the Dark Archive, a report was run on 11/05/2015. The first level sub-folder within a department folder is considered a "collection." While this may not be always true it at least gives us a starting point in attempting to quantify the scope and its impact.

The interesting aspect here is that 38% of the collections (via 3 departments) account for 89% of the files, while 2 departments that have 57% of the collections only account for 7% of the files:

Department

# of collections

% of collections

% of files

ARV

11

5%

24%

BPA

1

0%

1%

CGA

1

0%

0%

DI

22

10%

18%

HIL

4

2%

0%

KB

48

23%

47%

MUS

3

1%

3%

RAR

52

25%

4%

TRI

68

32%

3%

Totals

210

100%

100%

Formats

In seeking to determine what file types are in the Dark Archive a report was run (11/05/2015) to identify the quantities of common file types we may expect to migrate:

  • A/V: avi, m4v, mov, ,mp3, mp4, mpeg, mpeg2 and wav
  • Database: accdb and mdb
  • Document: doc, docx, ,pdf, rtf, txt and wpd
  • Image: bmp, jp2, jpeg, jpg, png, tif and tiff
  • Presentation: ppt and pptx
  • Spreadsheet: xls and xlsx
  • Web: htm and html
  • xml     
  • zip

Of the 1,963,990 files found in the "In Scope" folders in the Dark Archives 1,842,122 (or 94%) are the common file types identified above. A subsequent query (12/16/2015) was run to identifying many of the remaining 6% of files types:

  • A/V: MOD, MPG, mpgindex, prev, swf. vprj, wma, wmv
  • Compressed: gz, tga
  • Database: MDB, sql
  • Document: DOC, PDF
  • Image: BMP, BridgeCache, BridgeCacheT, BridgeSort, gif, GIF, ico, jpe, psd
  • Presentation: thm
  • Programs: exe
  • Spreadsheet: csv
  • Web:css, js, php, url
  • XML
  • System: bak, dat, db, dll, DS_Store, info, ini, INI, log, LOG, md5, tmp
  • NEED TO BE IDENTIFIED:
    • There are at least 10 instances of each of these file types: 1, 2, 97, 98, ADD, asp, aub, auf, cct, chm, cof, cop, cos, cot, CR2, CTG, dv, ecf, EMA, eml, FAX, frf, HLP, ICM, IIQ, inc, indd, indt, inf, inx, JUN, ldif, lnk, mbx, mht, MOI, MRK, MTS, mxf, MXF, NEF, NEW, old, orig, out, pct, pl, prproj, RES, sdf, sh, sid, SIF, snm, SUM, swa, toc, u3p, VOB, xmp
    • There are an additional 559 file types that have less than 10 instances each:
      • 9 instances - 10 file types
      • 8 instances - 27 file types
      • 7 instances - 3 file types
      • 6 instances - 8 file types
      • 5 instances - 15 file types
      • 4 instances - 98 file types
      • 3 instances - 21 file types
      • 2 instances - 236 file types
      • 1 instance - 141 file types

(NOTE: These quantities are a moving target as files are being disposed of as de-duplication is ongoing and Master Objects are being added as projects are completed)

Format Counts

Of the 1,963,990 files found in the "In Scope" folders in the Dark Archives 1,842,122 (or 94%) are the common file types identified above (as of the 11/05/2015 report).

Type

Sub-type

Quantity

Total

A/V

 

 

7,490

 

avi

783

 

 

m4v

0

 

 

mov

319

 

 

mp3

488

 

 

mp4

4,833

 

 

mpeg

1

 

 

mpeg2

0

 

 

wav

1,066

 

Database

 

 

23

 

accdb

0

 

 

mdb

23

 

Document

 

 

237,234

 

doc

4,041

 

 

docx

38

 

 

pdf

216,726

 

 

rtf

37

 

 

txt

16,312

 

 

wpd

80

 

Image

 

 

1,335,001

 

bmp

55

 

 

jp2

13,331

314,290

 

jpeg

1,307

 

 

jpg

299,652

 

 

png

3,903

 

 

tif

1,012,982

1,016,753

 

tiff

3,771

 

Presentation

 

 

7

 

ppt

4

 

 

pptx

3

 

Spreadsheet

 

 

91

 

xls

78

 

 

xlsx

13

 

Web

 

 

6,413

 

htm

592

 

 

html

5,821

 

xml     

 

240,714

240,714

zip

 

15,149

15,149

(NOTE: These quantities are a moving target as files are being disposed of as de-duplication is ongoing and Master Objects are being added as projects are completed)

Format Analysis

By far the TIFF images (1,016,753: tiff = 1,012,982 and tif = 3,771) are the largest group (55%) of format types that will need to migrated.  It can be presumed that these are Master Objects.

The next largest group are JPEGs (314,290: jpg = 299,652; jpeg = 1,307; and jp2 = 13,331). We will need to determine if these are Provisional Masters, or whether they are derivatives (or some other category) that we would not migrate.

The third largest category is XML with 240,714 objects.  173,037 (72%) of these are metadata for the President's Paper in ARV; 45,810 (19%) are in KB; 18,157 (8%) in DI; and the remaining 1% in all the other collections/departments. 

The last significantly large file type is PDF with 216,726 files of which 172,938 (80%) are the President's Paper in ARV. We will need to determine if the remaining 20% are Provisional Masters, or whether they are derivatives (or some other category) that we would not migrate.

There are 20,508 (doc = 4,041; docx = 38; rtf = 37; txt = 16,312; and wpd = 80) documents. We will need to determine whether these are Master Objects, metadata resources, project documentation or non-migration materials.

Of particular concern are 15,149 ZIP files. 15,098 (99.7%) are in DI, which means these are likely from vendors, have been unzipped and can be discarded. The remaining 51 will need to be examined to see if they have been unzipped and whether they duplicate existing materials.

There are 6,413 web pages (htm = 592 and html = 5,821) of which 4,056 (63%) belong to MUS.  The concern with all of these is whether they are derivative exhibit like materials or as is the case with some of the DI stuff it is processing notes supplied by a vendor.

There are 91 spreadsheets (xls = 78 and xlsx = 13); are these metadata, Master Objects or other materials?

Master Objects Migration Project Plan Home

The Ohio State University

If you have a disability and experience difficulty accessing this content, please contact LIB-a11y@osu.edu.