Data extracts

Daily exports of the data gathered by the archive. Available on GitHub at archive.org/download/news-homepages-extracts/.

sites.csv

A roster of all the sites collected by the archiver.

URL: archive.org/download/news-homepages-extracts/sites.csv

Field

Description

handle

The unique handle of the outlet. A unique identifier

name

The name of the outlet

url

The URL of the homepage

location

The city where the site is based

timezone

The timezone where the site is based

country

The country where the site is based, recorded as a two-digit ISO 3166-1 alpha code

language

The language of the site, recorded as a two-digit ISO 639-1 alpha code

bundles.csv

Categories used to group sites.

URL: archive.org/download/news-homepages-extracts/bundles.csv

Field

Description

slug

A unique identifier

name

The name of the outlet

location

The city where the site is based

timezone

The timezone where the site is based

site-bundle-relationships.csv

The one-to-many relationship between sites and bundles. Used to join the sites and bundles files.

URL: archive.org/download/news-homepages-extracts/site-bundle-relationships.csv

Field

Description

site_handle

A unique identifier for a site

bundle_slug

A unique identifier for a bundle

items.csv

The Internet Archive items where files are stored. They all belong to the collection at archive.org/details/news-homepages.

URL: archive.org/download/news-homepages-extracts/items.csv

Field

Description

identifier

The unique identifier created by Internet Archive

handle

The unique handle of the outlet. Can be used to merge with other files

file_name

The name of the file on GitHub where the Internet Archive metadata is stored.

url

The URL on archive.org where you can find the item

title

The title of the item on Internet Archive

date

The time period the item covers

publicdate

The date the item went public

addeddate

The date the item was created

screenshot-files.csv

The image files saved in the Internet Archive.

URL: archive.org/download/news-homepages-extracts/screenshot-files.csv

Field

Description

identifier

The unique identifier created by Internet Archive

handle

The unique handle of the outlet. Can be used to merge with other files

file_name

The name of the file in the Internet Archive

url

The URL of the file

mtime

The time the file was last modified by the Internet Archive in UTC time

size

The size of the file in bytes

md5

A checksum for the file created using md5 hashing

sha1

A checksum for the file created using sh1 hashing

type

The type of screenshot. Either cropped or fullpage.

accessibility-files.csv

The accessibility information related to HTML elements in the page, captured and stored in the Internet Archive.

URL: archive.org/download/news-homepages-extracts/accessibility-files.csv

Field

Description

identifier

The unique identifier created by Internet Archive

handle

The unique handle of the outlet. Can be used to merge with other files

file_name

The name of the file in the Internet Archive

url

The URL of the file

mtime

The time the file was last modified by the Internet Archive in UTC time

size

The size of the file in bytes

md5

A checksum for the file created using md5 hashing

sha1

A checksum for the file created using sh1 hashing

lighthouse-files.csv

The summary file from a page quality report generated by Google Lighthouse, captured and stored in the Internet Archive

URL: archive.org/download/news-homepages-extracts/lighthouse-files.csv

Field

Description

identifier

The unique identifier created by Internet Archive

handle

The unique handle of the outlet. Can be used to merge with other files

file_name

The name of the file in the Internet Archive

url

The URL of the file

mtime

The time the file was last modified by the Internet Archive in UTC time

size

The size of the file in bytes

md5

A checksum for the file created using md5 hashing

sha1

A checksum for the file created using sh1 hashing

lighthouse-sample.csv

The Lighthouse metric scores recorded over the last seven days for all sites.

URL: archive.org/download/news-homepages-extracts/lighthouse-sample.csv

Field

Description

identifier

The unique identifier created by Internet Archive

handle

The unique handle of the outlet. Can be used to merge with other files

file_name

The name of the file in the Internet Archive

date

The datetime when the audit was captured

performance

Lighthouse’s performance metric score

accessbility

Lighthouse’s accessibility metric score

best_practices

Lighthouse’s best practices metric score

seo

Lighthouse’s search engine optimization metric score

pwa

Lighthouse’s progressive web application metric score

lighthouse-analysis.csv

An analysis of Lighthouse metrics drawn from the sample.

URL: archive.org/download/news-homepages-extracts/lighthouse-analysis.csv

Field

Description

handle

The unique handle of the outlet. Can be used to merge with other files

performance_count

The number of Lighthouse performance metric observations

performance_median

The median performance metric score

performance_mean

The average performance metric score

performance_min

The lowest performance metric score

performance_max

The highest performance metric score

performance_std

The standard deviation of performance metrics

accessibility_count

The number of Lighthouse accessibility metric observations

accessibility_median

The median accessibility metric score

accessibility_mean

The average accessibility metric score

accessibility_min

The lowest accessibility metric score

accessibility_max

The highest accessibility metric score

accessibility_std

The standard deviation of accessibility metrics

seo_count

The number of Lighthouse search-engine optimization metric observations

seo_median

The median search-engine optimization metric score

seo_mean

The average search-engine optimization metric score

seo_min

The lowest search-engine optimization metric score

seo_max

The highest search-engine optimization metric score

seo_std

The standard deviation of search-engine optimization metrics

seo_count

The number of Lighthouse search-engine optimization metric observations

best_practices_median

The median best practices metric score

best_practices_mean

The average best practices metric score

best_practices_min

The lowest best practices metric score

best_practices_max

The highest best practices metric score

best_practices_std

The standard deviation of best practices metrics

performance_color

The classification of the median result using Lighthouse’s three tier system. 0 to 49 is red. 50 to 89 is orange. 90 to 100 is green.

accessibility_color

The classification of the median result using Lighthouse’s three tier system. 0 to 49 is red. 50 to 89 is orange. 90 to 100 is green.

seo_color

The classification of the median result using Lighthouse’s three tier system. 0 to 49 is red. 50 to 89 is orange. 90 to 100 is green.

best_practices_color

The classification of the median result using Lighthouse’s three tier system. 0 to 49 is red. 50 to 89 is orange. 90 to 100 is green.

performance_rank

The site’s ranking when sorted by performance_median

accessibility_rank

The site’s ranking when sorted by accessibility_median

seo_rank

The site’s ranking when sorted by seo_median

best_practices_rank

The site’s ranking when sorted by best_practices_median

robotstxt-files.csv

A list of robots.txt files, captured and stored in the Internet Archive

URL: archive.org/download/news-homepages-extracts/robotstxt-files.csv

Field

Description

identifier

The unique identifier for the Internet Archive item

handle

The unique handle of the outlet. Can be used to merge with other files

file_name

The name of the file in the Internet Archive item

url

The URL of the file

mtime

The time the file was last modified by the Internet Archive in UTC time

size

The size of the file in bytes

md5

A checksum for the file created using md5 hashing

sha1

A checksum for the file created using sh1 hashing

robotstxt-sample.csv

The rules extracted from the latest robots.txt file archived for each site.

URL: archive.org/download/news-homepages-extracts/robotstxt-sample.csv

Field

Description

identifier

The unique identifier for the Internet Archive item

handle

The unique handle of the outlet. Can be used to merge with other files

date

The date when the file was captured

url

The URL to archived file on archive.org

user_agent

A user agent declared in the robots.txt file

rules

The rules declared for the user agent

wayback-files.csv

The status report from a Wayback Machine capture request made via the Internet Archive’s Save Page Now API

URL: archive.org/download/news-homepages-extracts/wayback-files.csv

Field

Description

identifier

The unique identifier created by Internet Archive

handle

The unique handle of the outlet. Can be used to merge with other files

file_name

The name of the file in the Internet Archive

url

The URL of the file

mtime

The time the file was last modified by the Internet Archive in UTC time

size

The size of the file in bytes

md5

A checksum for the file created using md5 hashing

sha1

A checksum for the file created using sh1 hashing

drudge-entities-analysis.csv

An analysis of entities extracted from Drudge Report headlines over the past 90 days

URL: archive.org/download/news-homepages-extracts/drudge-entities-analysis.csv

Field

Description

lemma

The root word

n

How many headlines the word appeared in over the last 90 days

top_verb

The most frequently used verb in headlines containing this word

timeseries

The number of headlines featuring this word on each day in our time range