Data extracts¶
Daily exports of the data gathered by the archive. Available on GitHub at archive.org/download/news-homepages-extracts/.
sites.csv¶
A roster of all the sites collected by the archiver.
URL: archive.org/download/news-homepages-extracts/sites.csv
Field |
Description |
---|---|
|
The unique handle of the outlet. A unique identifier |
|
The name of the outlet |
|
The URL of the homepage |
|
The city where the site is based |
|
The timezone where the site is based |
|
The country where the site is based, recorded as a two-digit ISO 3166-1 alpha code |
|
The language of the site, recorded as a two-digit ISO 639-1 alpha code |
bundles.csv¶
Categories used to group sites.
URL: archive.org/download/news-homepages-extracts/bundles.csv
Field |
Description |
---|---|
|
A unique identifier |
|
The name of the outlet |
|
The city where the site is based |
|
The timezone where the site is based |
site-bundle-relationships.csv¶
The one-to-many relationship between sites and bundles. Used to join the sites
and bundles
files.
URL: archive.org/download/news-homepages-extracts/site-bundle-relationships.csv
Field |
Description |
---|---|
|
A unique identifier for a site |
|
A unique identifier for a bundle |
items.csv¶
The Internet Archive items where files are stored. They all belong to the collection at archive.org/details/news-homepages.
URL: archive.org/download/news-homepages-extracts/items.csv
Field |
Description |
---|---|
|
The unique identifier created by Internet Archive |
|
The unique handle of the outlet. Can be used to merge with other files |
|
The name of the file on GitHub where the Internet Archive metadata is stored. |
|
The URL on archive.org where you can find the item |
|
The title of the item on Internet Archive |
|
The time period the item covers |
|
The date the item went public |
|
The date the item was created |
screenshot-files.csv¶
The image files saved in the Internet Archive.
URL: archive.org/download/news-homepages-extracts/screenshot-files.csv
Field |
Description |
---|---|
|
The unique identifier created by Internet Archive |
|
The unique handle of the outlet. Can be used to merge with other files |
|
The name of the file in the Internet Archive |
|
The URL of the file |
|
The time the file was last modified by the Internet Archive in UTC time |
|
The size of the file in bytes |
|
|
|
|
|
The type of screenshot. Either |
accessibility-files.csv¶
The accessibility information related to HTML elements in the page, captured and stored in the Internet Archive.
URL: archive.org/download/news-homepages-extracts/accessibility-files.csv
Field |
Description |
---|---|
|
The unique identifier created by Internet Archive |
|
The unique handle of the outlet. Can be used to merge with other files |
|
The name of the file in the Internet Archive |
|
The URL of the file |
|
The time the file was last modified by the Internet Archive in UTC time |
|
The size of the file in bytes |
|
|
|
hyperlink-files.csv¶
A list of all hyperlinks in the page, captured and stored in the Internet Archive
URL: archive.org/download/news-homepages-extracts/hyperlink-files.csv
Field |
Description |
---|---|
|
The unique identifier created by Internet Archive |
|
The unique handle of the outlet. Can be used to merge with other files |
|
The name of the file in the Internet Archive |
|
The URL of the file |
|
The time the file was last modified by the Internet Archive in UTC time |
|
The size of the file in bytes |
|
|
|
lighthouse-files.csv¶
The summary file from a page quality report generated by Google Lighthouse, captured and stored in the Internet Archive
URL: archive.org/download/news-homepages-extracts/lighthouse-files.csv
Field |
Description |
---|---|
|
The unique identifier created by Internet Archive |
|
The unique handle of the outlet. Can be used to merge with other files |
|
The name of the file in the Internet Archive |
|
The URL of the file |
|
The time the file was last modified by the Internet Archive in UTC time |
|
The size of the file in bytes |
|
|
|
lighthouse-sample.csv¶
The Lighthouse metric scores recorded over the last seven days for all sites.
URL: archive.org/download/news-homepages-extracts/lighthouse-sample.csv
Field |
Description |
---|---|
|
The unique identifier created by Internet Archive |
|
The unique handle of the outlet. Can be used to merge with other files |
|
The name of the file in the Internet Archive |
|
The datetime when the audit was captured |
|
Lighthouse’s performance metric score |
|
Lighthouse’s accessibility metric score |
|
Lighthouse’s best practices metric score |
|
Lighthouse’s search engine optimization metric score |
|
Lighthouse’s progressive web application metric score |
lighthouse-analysis.csv¶
An analysis of Lighthouse metrics drawn from the sample.
URL: archive.org/download/news-homepages-extracts/lighthouse-analysis.csv
Field |
Description |
---|---|
|
The unique handle of the outlet. Can be used to merge with other files |
|
The number of Lighthouse performance metric observations |
|
The median performance metric score |
|
The average performance metric score |
|
The lowest performance metric score |
|
The highest performance metric score |
|
The standard deviation of performance metrics |
|
The number of Lighthouse accessibility metric observations |
|
The median accessibility metric score |
|
The average accessibility metric score |
|
The lowest accessibility metric score |
|
The highest accessibility metric score |
|
The standard deviation of accessibility metrics |
|
The number of Lighthouse search-engine optimization metric observations |
|
The median search-engine optimization metric score |
|
The average search-engine optimization metric score |
|
The lowest search-engine optimization metric score |
|
The highest search-engine optimization metric score |
|
The standard deviation of search-engine optimization metrics |
|
The number of Lighthouse search-engine optimization metric observations |
|
The median best practices metric score |
|
The average best practices metric score |
|
The lowest best practices metric score |
|
The highest best practices metric score |
|
The standard deviation of best practices metrics |
|
The classification of the median result using Lighthouse’s three tier system. 0 to 49 is red. 50 to 89 is orange. 90 to 100 is green. |
|
The classification of the median result using Lighthouse’s three tier system. 0 to 49 is red. 50 to 89 is orange. 90 to 100 is green. |
|
The classification of the median result using Lighthouse’s three tier system. 0 to 49 is red. 50 to 89 is orange. 90 to 100 is green. |
|
The classification of the median result using Lighthouse’s three tier system. 0 to 49 is red. 50 to 89 is orange. 90 to 100 is green. |
|
The site’s ranking when sorted by |
|
The site’s ranking when sorted by |
|
The site’s ranking when sorted by |
|
The site’s ranking when sorted by |
robotstxt-files.csv¶
A list of robots.txt files, captured and stored in the Internet Archive
URL: archive.org/download/news-homepages-extracts/robotstxt-files.csv
Field |
Description |
---|---|
|
The unique identifier for the Internet Archive item |
|
The unique handle of the outlet. Can be used to merge with other files |
|
The name of the file in the Internet Archive item |
|
The URL of the file |
|
The time the file was last modified by the Internet Archive in UTC time |
|
The size of the file in bytes |
|
|
|
robotstxt-sample.csv¶
The rules extracted from the latest robots.txt file archived for each site.
URL: archive.org/download/news-homepages-extracts/robotstxt-sample.csv
Field |
Description |
---|---|
|
The unique identifier for the Internet Archive item |
|
The unique handle of the outlet. Can be used to merge with other files |
|
The date when the file was captured |
|
The URL to archived file on archive.org |
|
A user agent declared in the robots.txt file |
|
The rules declared for the user agent |
wayback-files.csv¶
The status report from a Wayback Machine capture request made via the Internet Archive’s Save Page Now API
URL: archive.org/download/news-homepages-extracts/wayback-files.csv
Field |
Description |
---|---|
|
The unique identifier created by Internet Archive |
|
The unique handle of the outlet. Can be used to merge with other files |
|
The name of the file in the Internet Archive |
|
The URL of the file |
|
The time the file was last modified by the Internet Archive in UTC time |
|
The size of the file in bytes |
|
|
|
drudge-hyperlinks-sample.csv¶
Hyperlinks gathered from the Drudge Report over the past 90 days.
URL: archive.org/download/news-homepages-extracts/drudge-hyperlinks-sample.csv
Field |
Description |
---|---|
|
The unique identifier created by Internet Archive |
|
The unique handle of the outlet. Can be used to merge with other files |
|
The name of the file in the Internet Archive |
|
The datetime when the hyperlink was captured |
|
The text of the hyperlink |
|
The URL source attribute of the hyperlink |
drudge-hyperlinks-analysis.csv¶
An analysis of hyperlinks gathered from the Drudge Report over the past 90 days
URL: archive.org/download/news-homepages-extracts/drudge-hyperlinks-analysis.csv
Field |
Description |
---|---|
|
The text of the hyperlink |
|
The URL source attribute of the hyperlink |
|
The earliest datetime when the hyperlink was captured |
|
Whether or not our machine-learning system estimates that the URL leads to a news story |
|
The web domain where the URL is hosted |
drudge-entities-analysis.csv¶
An analysis of entities extracted from Drudge Report headlines over the past 90 days
URL: archive.org/download/news-homepages-extracts/drudge-entities-analysis.csv
Field |
Description |
---|---|
|
The root word |
|
How many headlines the word appeared in over the last 90 days |
|
The most frequently used verb in headlines containing this word |
|
The number of headlines featuring this word on each day in our time range |