Internet Archive Metadata Fields

Metadata Schema

Below are _meta.xml fields that have special meaning on archive.org.

access-restricted

required: No
repeatable: No
internal use only: Yes
defined by: IA admin
edit access: IA admin
label: Access Restricted
definition: Collection contents are restricted access
accepted values: true
usage notes: This tag is only used on items of mediatype collection (it will have no affect on items of any other type). This tag should only be assigned by internal IA admins.
example: true

access-restricted-item

required: No
repeatable: No
label: Access Restricted Item
definition: Identifies item that is access-restricted
accepted values: true
usage notes: Only used on items, not collections. Automatically added to items in an access-restricted collection at the end of any task.
example: true
defined by: IA admin
edit access: IA admin
internal use only: Yes

adaptive_ocr

required: No
repeatable: No
label: Adaptive OCR
definition: Allows deriver to skip a page that would otherwise disrupt OCR
accepted values: true
defined by: uploader
edit access: uploader
internal use only: No

addeddate

required: Yes
repeatable: No
internal use only: No
defined by: IA software
edit access: not editable
label: Date Added to Public Search
definition: 2019-12 and later dates: represents time item was added to public search engine. Earlier dates: Date and time in UTC that the item was created archive.org
accepted values: YYYY-MM-DD HH:MM:SS YYYY-MM-DD
usage notes: Beginning in December 2019 when item was first added to public search engine. It is added during the first task where the item does not have noindex present in meta.xml. The field is not changed or removed if the item is subsequently removed from public search. Prior to December 2019, Addeddate was automatically set when the item directory is been created in our file system. In many cases, the addeddate will be very similar to the publicdate. However, in some cases we create an item directory with metadata but no media files prior to the media being scanned. The addeddate reflects when the item was created, regardless of when the media was added to the item. When the media is added at a later date and a derive.php task is run, the publicdate will be added to the item.
example: 2017-03-28 22:05:46

admin-collection

required: No
repeatable: No
internal use only: Yes
edit access: IA admin
label: Admin Collection
definition: Collection will generally be suppressed from public display, e.g. in facets, membership lists on Collection/Details pages, etc.
accepted values: true
usage notes: Only used by internal IA admins
example: true
defined by: IA admin

aspect_ratio

required: No
repeatable: No
internal use only: No
defined by: uploader
edit access: uploader
label: Aspect Ratio
definition: Ratio of the pixel width and height of a video stream
accepted values: #:#
usage notes: Standard values for this field are 4:3 and 16:9, but other values are possible.
example: 4:3

audio_codec

required: No
repeatable: No
internal use only: No
defined by: IA software
edit access: IA admin
label: Audio Codec
definition: Program used to decode audio stream
accepted values: String
usage notes: Primarily used for TV Archive items.
example: ac3

audio_sample_rate

required: No
repeatable: No
internal use only: No
defined by: IA software
edit access: IA admin
label: Audio Sample Rate
definition: Samples per second
accepted values: Whole number
usage notes: Primarily used for TV Archive items.
example: 48000

betterpdf

required: No
repeatable: No
internal use only: No
defined by: uploader
edit access: uploader
label: Better PDF
definition: Indicates that the derive module should create a higher quality PDF derivative (distinguishes text from background better).
accepted values: true
usage notes: This field is either set to the value true, or is not included in meta.xml. If this field is included after the initial derive is run, user should also run a derive task to create the better quality PDF.
example: true

bookreader-defaults

required: No
repeatable: No
internal use only: No
defined by: uploader
edit access: uploader
label: Bookreader defaults
definition: Indicates whether the bookreader should display one or two pages by default
accepted values: mode/1up mode/2up mode/thumb
usage notes: The bookreader defaults to showing books in 2up mode, so this field is generally only used to indicate that an item should be displayed in 1up mode (showing only one page at a time in the bookreader).
example: mode/1up

boxid

required: No
repeatable: Yes
definition: Location of physical item in the Physical Archive
accepted values: IA######
usage notes: Boxids always start with the letters IA followed by numbers. The numbers represent the container, pallet and box that the physical item is stored in. When there are multiple boxid fields in meta.xml, the first boxid listed represents the physical item that was digitized. Subsequent boxid fields represent the location of duplicate physical items.
example: IA158001
edit access: IA admin
defined by: IA admin
label: Box ID
internal use only: Yes

bwocr

required: No
repeatable: No
internal use only: No
defined by: uploader
edit access: uploader
label: Black and White OCR
definition: Allows deriver to OCR specific pages as B&W if color is causing failure.
accepted values: page number or range, e.g. 001

call_number

required: No
repeatable: No
internal use only: No
defined by: uploader
edit access: uploader
label: Call Number
definition: Contributing library’s local call number
accepted values: string
example: 6675707, NC 285.1 P9287m

camera

required: No
repeatable: No
internal use only: No
defined by: user admin
edit access: user admin
label: Camera
definition: Camera model used during digitization process
accepted values: String
example: Canon 5D

ccnum

required: No
repeatable: No
internal use only: No
edit access: uploader
label: Closed Captioning Number
definition: Indicates which closed captioning file should be used for display and search
accepted values: cc# asr ocr #
usage notes: Primarily used for TV Archive items. Closed captioning files are stored as [identifier].cc#.txt in the item. This tag indicates which cc# file to display in item and use for search indexing.
example: cc5
defined by: uploader

closed_captioning

required: No
repeatable: No
internal use only: No
defined by: uploader
edit access: uploader
label: Closed Captioning
definition: Indicates whether item contains closed captioning files
accepted values: yes no
usage notes: Field is generally only present when the video has closed captioning. When captioning is not present, the field may have “no” as the value, or just not be included in meta.xml
example: yes, no

collection

required: No
repeatable: Yes
internal use only: No
defined by: user admin
edit access: user admin
label: Collections
definition: Indicates to the website what collection(s) this item belongs to.
accepted values: Must be a valid identifier
example: prelinger
usage notes: Required for all items except “fav-username” collections.

Always list the item’s primary collection first in meta.xml; this is the collection the item “belongs” to. The primary collection often represents the entity that contributed or created the content.

Uploaders can only choose from collections that they have privileges for. General uploaders with no special privs can only upload to selected “Community” collections or the test_collection. Items in the test_collection are removed from the site after 30 days.


Parent collections:


If the parent collections of an item’s collections are not already included in the item’s own collection list, they will be automatically added (at the end of the next task on the item).

Here, more specifically, is how that addition takes place:

All of the item’s currently listed collections are considered in turn. For each of them, we trace its ancestry all the way up to the top-level collection (usually a mediatype); in tracing ancestry, we consider only the primary (first-listed) parent collection at each step. If the original item is itself a collection, we include the top-level collection, otherwise we don’t. Any collection we encounter during this traversal of the hierarchy that isn’t already in the item’s collection list gets added to the end of the list.

For example, if the original item starts with collections A and B listed, we find A’s primary parent (call it A-P), that collection’s primary parent (A-P-P), etc., until we hit a mediatype; all of those that aren’t already listed get added, including the mediatype if the item itself is a collection. Then we do the same for B and its primary parent B-P, B-P-P, etc.

color

required: No
repeatable: No
label: Color
definition: Indicates whether media is in color or black and white
accepted values: String
usage notes: Most used values are: color, B&W (black and white) Mostly used for video items, indicates whether video is color or black and white. Can be used to indicate different kinds of color (e.g. Kodachrome).
example: color
defined by: uploader
edit access: uploader
internal use only: No

condition

required: No
repeatable: No
label: Condition
definition: condition of media
accepted values: Mint Near Mint Very Good Good Fair Worn Poor Fragile Incomplete
usage notes: Defines the condition of the media in an item. In 78s and LPs this indicates the condition of the disc or media file. For sets with multiple discs, use the condition of the lowest grade disc in the set.
example: Good
defined by: uploader
edit access: uploader
internal use only: No

condition-visual

required: No
repeatable: No
internal use only: No
defined by: uploader
edit access: uploader
label: Visual Condition
definition: condition of the artwork or printed materials that accompany a media item
accepted values: Mint Near Mint Very Good Good Fair Worn Poor Fragile Incomplete None Unknown
example: Good
usage notes: Defines the condition of the artwork or printed materials that accompany the media in an item. In LPs this is used for album covers and sleeves. “Incomplete” should be used when we know that artwork/printed materials are missing. “None” should be used when the item has no artwork/printed materials and we know it is not supposed to have any. “Unknown” should be used when artwork/printed material MAY be missing, but we cannot verify.

contributor

required: No
repeatable: No
label: Contributor
definition: The person or organization that provided the physical or digital media.
accepted values: String
usage notes: For physical items that have been digitized, contributor represents the library or other organization that owns the physical item. For born-digital media, contributor often represents the organization responsible for the distribution of the content (e.g. a radio station or television station).
example: Robarts - University of Toronto
defined by: uploader
edit access: uploader
internal use only: No

coverage

required: No
repeatable: Yes
internal use only: No
defined by: uploader
edit access: uploader
label: Coverage
definition: Geographic or subject area covered by item
accepted values: String
usage notes: The preferred use of this field is to signify a geographical location that relates to the item. For example, in the TV and radio collections, we use the ISO 3166 location code for the country and state/territory of the station being recorded.
example: GB-LND

creator

required: No
repeatable: Yes
definition: The individual(s) or organization that created the media content.
usage notes: For items provided by libraries, the creator is often listed using the Library of Congress Name Authority Headings, http://authorities.loc.gov/ For items from other sources, the creator is often listed as first name and surname. When an item was created by an organization, such as a government agency or a production company, use the full name of the organization. This field represents the entity who created the media, not the person who uploaded the media to archive.org (though these may be the same person). All alphabets supported.
example: Austen, Jane, 1775-1817, Ralph Burns
edit access: uploader
defined by: uploader
label: Creator
internal use only: No
accepted values: String

creator-alt-script

required: No
repeatable: No
internal use only: No
defined by: uploader
edit access: uploader
label: Creator Alternate Script

curation

required: No
repeatable: No
internal use only: Yes
defined by: IA admin
edit access: IA admin
label: Curation
definition: Curation state and notes
accepted values: String
usage notes: Curation is a compound field with “sub-fields”: curator, date, state, and comment. - Curator is the email address of the person who added the curation tag. - Date is the UTC time and date the curation tag was added, in YYYYMMDDHHMMSS format. - State can be: dark, approved, freeze, un-dark or blank - Comment can be a code used by the scanning center team to indicate issues found during QA, or a text string with some other curation comment (e.g. information about why an item was frozen or made dark). Items uploaded into open collections are generally checked by malware detection software, and the curation field will contain the results of that check.
example: [curator]lenscriv@archive.org[/curator][date]20160504125613[/date][state]approved[/state][comment]199[/comment], [curator]malware@archive.org[/curator][date]20140321085621[/date][comment]checked for malware[/comment]

date

required: No
repeatable: No
internal use only: No
defined by: uploader
edit access: uploader
label: Publication Date
definition: Date of publication
accepted values: String
usage notes: We encourage people to use YYYY, YYYY-MM, or YYYY-MM-DD for this field, but sometimes exact dates are not possible to determine. Other common usages: [YYYY] (brackets) when a date is not certain; c.a. YYYY (c.a.) when a date is approximate; and [n.d.] when a date is unknown (you may also leave the field blank in this case). If an item has a date range, such as YYYY-YYYY, we currently index only the first date in the range. Books, movies, and CDs often only have YYYY for a publication date. Magazines often have YYYY-MM for a publication date. Concerts and articles often have YYYY-MM-DD publication dates. Use the most specific verifiable date you have access to. When the item is a digitial representation of a physical piece of media (e.g. a book, a 78rpm disc, etc.) the publication date should represent the date that the specific physical item was published. A book may have been written in 1850, and then an edition was republished in 1885. If the digitized version is the edition republished in 1885, use 1885 as the publication date (not 1850).
example: 1965, 2013-05-25, [n.d.]

description

required: Recommended