MOM Formats and Collections Scope
Master Objects Migration Project Plan Home
Scope
Location | In Scope? |
Dark Archive | |
/archive/Archived | No |
/archive/Committees | No |
/archive/Dept/ARV | Yes |
/archive/Dept/ATH | Yes |
/archive/Dept/CGA | Yes |
/archive/Dept/CSS | No |
/archive/Dept/DI | Yes |
/archive/Dept/HIL | Yes |
/archive/Dept/KB | Yes, as it pertains to MO determinations in other collection folders |
/archive/Dept/MUS | Yes |
/archive/Dept/RAR | Yes |
/archive/Dept/SRI | No, this is just a mapping to /archive/Dept/KB |
/archive/Dept/TRI | Yes |
/archive/Dept/WIT | No |
/archive/Fedora | No |
/archive/FedoraMORdata | No |
/archive/lost+found | No |
J-Drive | Yes, as it pertains to MO determinations in DA folders |
External Media | Yes |
This analysis accounts for files noted as "Yes" in the above table that are in the Dark Archive. It does not currently account for those in the J-Drive or External Media.
Detailed data can be found in: fileTypeCount-20151105-rev-dwn-20151106.xlsx
Collections
For this analysis of materials in the Dark Archive, a report was run on 11/05/2015. The first level sub-folder within a department folder is considered a "collection." While this may not be always true it at least gives us a starting point in attempting to quantify the scope and its impact.
The interesting aspect here is that 38% of the collections (via 3 departments) account for 89% of the files, while 2 departments that have 57% of the collections only account for 7% of the files:
Department | # of collections | % of collections | % of files |
ARV | 11 | 5% | 24% |
BPA | 1 | 0% | 1% |
CGA | 1 | 0% | 0% |
DI | 22 | 10% | 18% |
HIL | 4 | 2% | 0% |
KB | 48 | 23% | 47% |
MUS | 3 | 1% | 3% |
RAR | 52 | 25% | 4% |
TRI | 68 | 32% | 3% |
Totals | 210 | 100% | 100% |
Formats
In seeking to determine what file types are in the Dark Archive a report was run (11/05/2015) to identify the quantities of common file types we may expect to migrate:
- A/V: avi, m4v, mov, ,mp3, mp4, mpeg, mpeg2 and wav
- Database: accdb and mdb
- Document: doc, docx, ,pdf, rtf, txt and wpd
- Image: bmp, jp2, jpeg, jpg, png, tif and tiff
- Presentation: ppt and pptx
- Spreadsheet: xls and xlsx
- Web: htm and html
- xml
- zip
Of the 1,963,990 files found in the "In Scope" folders in the Dark Archives 1,842,122 (or 94%) are the common file types identified above. A subsequent query (12/16/2015) was run to identifying many of the remaining 6% of files types:
- A/V: MOD, MPG, mpgindex, prev, swf. vprj, wma, wmv
- Compressed: gz, tga
- Database: MDB, sql
- Document: DOC, PDF
- Image: BMP, BridgeCache, BridgeCacheT, BridgeSort, gif, GIF, ico, jpe, psd
- Presentation: thm
- Programs: exe
- Spreadsheet: csv
- Web:css, js, php, url
- XML
- System: bak, dat, db, dll, DS_Store, info, ini, INI, log, LOG, md5, tmp
- NEED TO BE IDENTIFIED:
- There are at least 10 instances of each of these file types: 1, 2, 97, 98, ADD, asp, aub, auf, cct, chm, cof, cop, cos, cot, CR2, CTG, dv, ecf, EMA, eml, FAX, frf, HLP, ICM, IIQ, inc, indd, indt, inf, inx, JUN, ldif, lnk, mbx, mht, MOI, MRK, MTS, mxf, MXF, NEF, NEW, old, orig, out, pct, pl, prproj, RES, sdf, sh, sid, SIF, snm, SUM, swa, toc, u3p, VOB, xmp
- There are an additional 559 file types that have less than 10 instances each:
- 9 instances - 10 file types
- 8 instances - 27 file types
- 7 instances - 3 file types
- 6 instances - 8 file types
- 5 instances - 15 file types
- 4 instances - 98 file types
- 3 instances - 21 file types
- 2 instances - 236 file types
- 1 instance - 141 file types
(NOTE: These quantities are a moving target as files are being disposed of as de-duplication is ongoing and Master Objects are being added as projects are completed)
Format Counts
Of the 1,963,990 files found in the "In Scope" folders in the Dark Archives 1,842,122 (or 94%) are the common file types identified above (as of the 11/05/2015 report).
Type | Sub-type | Quantity | Total |
A/V |
|
| 7,490 |
| avi | 783 |
|
| m4v | 0 |
|
| mov | 319 |
|
| mp3 | 488 |
|
| mp4 | 4,833 |
|
| mpeg | 1 |
|
| mpeg2 | 0 |
|
| wav | 1,066 |
|
Database |
|
| 23 |
| accdb | 0 |
|
| mdb | 23 |
|
Document |
|
| 237,234 |
| doc | 4,041 |
|
| docx | 38 |
|
| 216,726 |
| |
| rtf | 37 |
|
| txt | 16,312 |
|
| wpd | 80 |
|
Image |
|
| 1,335,001 |
| bmp | 55 |
|
| jp2 | 13,331 | 314,290 |
| jpeg | 1,307 |
|
| jpg | 299,652 |
|
| png | 3,903 |
|
| tif | 1,012,982 | 1,016,753 |
| tiff | 3,771 |
|
Presentation |
|
| 7 |
| ppt | 4 |
|
| pptx | 3 |
|
Spreadsheet |
|
| 91 |
| xls | 78 |
|
| xlsx | 13 |
|
Web |
|
| 6,413 |
| htm | 592 |
|
| html | 5,821 |
|
xml |
| 240,714 | 240,714 |
zip |
| 15,149 | 15,149 |
(NOTE: These quantities are a moving target as files are being disposed of as de-duplication is ongoing and Master Objects are being added as projects are completed)
Format Analysis
By far the TIFF images (1,016,753: tiff = 1,012,982 and tif = 3,771) are the largest group (55%) of format types that will need to migrated. It can be presumed that these are Master Objects.
The next largest group are JPEGs (314,290: jpg = 299,652; jpeg = 1,307; and jp2 = 13,331). We will need to determine if these are Provisional Masters, or whether they are derivatives (or some other category) that we would not migrate.
The third largest category is XML with 240,714 objects. 173,037 (72%) of these are metadata for the President's Paper in ARV; 45,810 (19%) are in KB; 18,157 (8%) in DI; and the remaining 1% in all the other collections/departments.
The last significantly large file type is PDF with 216,726 files of which 172,938 (80%) are the President's Paper in ARV. We will need to determine if the remaining 20% are Provisional Masters, or whether they are derivatives (or some other category) that we would not migrate.
There are 20,508 (doc = 4,041; docx = 38; rtf = 37; txt = 16,312; and wpd = 80) documents. We will need to determine whether these are Master Objects, metadata resources, project documentation or non-migration materials.
Of particular concern are 15,149 ZIP files. 15,098 (99.7%) are in DI, which means these are likely from vendors, have been unzipped and can be discarded. The remaining 51 will need to be examined to see if they have been unzipped and whether they duplicate existing materials.
There are 6,413 web pages (htm = 592 and html = 5,821) of which 4,056 (63%) belong to MUS. The concern with all of these is whether they are derivative exhibit like materials or as is the case with some of the DI stuff it is processing notes supplied by a vendor.
There are 91 spreadsheets (xls = 78 and xlsx = 13); are these metadata, Master Objects or other materials?
If you have a disability and experience difficulty accessing this content, please contact LIB-a11y@osu.edu.