G2LM|LIC - Browsers don't lie
Data Set Description
The project collaborated with PY Insights, an internet-browser analytics platform, and Dynata, a global first-party data platform, to field the survey between mid-May and early June 2020. Individuals drawn from Dynata’s marketing pools in India, Kenya and Nigeria were invited to participate in an online survey that ended with a consensual browser data upload using the PY Insights software. Participants with valid data were compensated for their effort. PY Insights’ internet browser extension collects retrospective data stored in each user’s browser account history. This is identical to what a participant would observe if they visited the History section of their internet browser on their personal computer. The records cover up to 90 days of past activity on the browser account, accumulated across all electronic devices (computer, smartphone, tablet). The researchers observe every website visit, including the URL (uniform resource locator, i.e., web address) and timestamp.
Although browser data can include records from multiple types of electronic devices, most smartphone browser apps do not support internet browser extensions or add-ons, so the PY Insights technology only collects data from personal computers. No information is collected from private browsing or Incognito mode, and personal identifiers are removed prior to analysis.
Each URL has an associated title, which conveys meaningful information, such as a Google search phrase, the headline of a newspaper article, or a YouTube video title. Using the URL, title, and timestamp for each website visit, PY Insights calculates its duration in seconds and provides a detailed categorization scheme for each website domain. The categories are based on Google Cloud Platform’s natural language processing algorithm. The project used these categories to identify websites as being primarily related to leisure (entertainment) or production (non-recreational). Leisure includes Adults, Arts & Entertainment, Games, Online Communities (including social media), and Shopping. Production includes Business & Industrial, Computers & Electronics, Finance, Internet & Telecom (including e-mail and search engines), Jobs & Education, Law & Government, News, Science, and Reference. Other Google Cloud categories combined cover 0.8% of our data. Some websites – such as spam webpages – are also labelled as “other”. Median “other” category usage on a day covers 7% of total time use. Because YouTube represents a sizeable portion of usage and is classified as leisure by PY Insights, also robustness checks were conducted in which YouTube videos as leisure or production related using Google’s YouTube API were re-classified.
The researchers collected data on individuals aged 22 to 54 located in 28 states across India, as well as Kenya and Nigeria. Individuals were prevented to use a new browser account or a secondary browser type that is not used regularly from participating by requiring at least 30 days of browser data. One user was dropped who preferred not to state their gender and took two steps to avoid computer bots: by including an attention test question in the survey and all users with an average of more than 3,000 URL visits per day were manually dropped.
Scope of Data Set
Time Periods: May - June 2020
Researchers working with the “G2LM|LIC - Browsers don't lie” are obligated to acknowledge the data base and its documentation within their publications, including the DOI, by using this reference.
- Miller, Amalia Rebecca (University of Virginia)
- Ramdas, Kamalini (London Business School)
- Sungu, Alp (London Business School)
- Wheeler Institute for Business and Development
Access to the data is provided to non-for-profit research, replication and teaching purposes. The data is available from the Research Data Center of IZA (IDSC).
Please contact IDSC for any access requests.
INDIA KENYA NIGERIA