Tag: Human behaviour

  • Why we are all lab rats in the digital world

    Why we are all lab rats in the digital world

    [ad_1]

    Whenever we go online, we might find ourselves part of an experiment — without knowing it. Digital platforms track what users do and how they respond to features. Increasingly, these tests are having real-world consequences for its participants.

    I’ve seen this in my own research on the gig economy, studying job-listing platforms that offer paid tasks and jobs to freelancers (H. A. Rahman et al. Acad. Mgmt. J. 66, 1803–1830; 2023). Platforms experimented with using different methods for scoring people’s work, as well as changing how their skills would be listed on their profile page and how they could interact with their contractors. These changes affected people’s ratings and the amount of work they received.

    Twenty years ago, such experimentation was transparent. Gig workers could opt in or out of tests. But today, these experiments are done covertly. Gig workers waive their rights when they create an account.

    Being experimented on can be disconcerting and disempowering. Imagine that, every time you enter your office, it has been redesigned. So has how you are evaluated, and how you can speak with your superiors, but without your knowledge or consent. Such continual changes affect how you do and feel about your job.

    Gig workers expressed that, after noticing frequent changes on the listing platforms that were made without their consent, they started to see themselves as laboratory rats rather than valued users. Because their messages were blocked by chatbots, they were unable to speak to the platform to complain or opt out of the changes. Frustration flared and apathy set in. Their income and well-being declined.

    This is concerning, not only because of how affects gig workers, but also because academics are increasingly becoming involved in designing digital experiments. Social scientists follow strict Institutional Review Board (IRB) procedures that govern the ethics of experiments involving people — such as informing them and requiring consent — but these rules don’t apply to technology companies. And that’s leading to questionable practices and potentially unreliable results.

    Technology companies use their terms of service to authorize them to collect data without any obligation to inform people that they were involved, or provide any opportunity for them to withdraw. Thus, digital experimentation faces scant oversight.

    Given that technology companies reach millions of people, experiments using their data can be informative. For example, a 2022 study by academics and the career platform LinkedIn answered questions about how weak links in people’s social networks contribute to job outcomes (K. Rajkumar et al. Science 377, 1304–1310; 2022). The platform varied the algorithm it uses to suggest new contacts for more than 20 million users. Those people were unaware, despite this potentially affecting their job prospects.

    Scientists themselves can be subject to such hidden practices. For example, in September, the journal Science acknowledged that studies it had published exploring political polarization using user feeds from the social-media platform Facebook were compromised when the technology giant changed its algorithm during the study period without the scientists’ knowledge (H. Holden Thorp and V. Vinson Science 385, 1393; 2024).

    Academics must be more wary about the data that they generate through collaborations with technology companies and rethink how they conduct this research. An ethically robust framework is needed for science–industry collaboration to ensure that experimentation does not jeopardize public trust in science.

    First, scholars should engage in a thorough ethics check by auditing potential partners and making sure that they follow IRB principles. They could work with or create intermediary watchdog organizations, just as Fairwork, based in Oxford, UK, safeguards gig-workers’ rights, which can audit experimentation practices and introduce transparency into data collection. They can diffuse and enforce ethical norms of experimentation, inform industry partners on how to conduct ethically sound research and hold them to account.

    Second, scholars need to evaluate the social effects of experimentation to study and mitigate any potential harm. This is not trivial, because experiments rarely consider the well-being of participants and don’t assess potential unintended consequences.

    Technology companies should establish their own internal review boards, which have the authority to assess and vet experiments. Industry needs to instil a culture of ethically robust experimentation, including understanding the potential adverse effects participants might face.

    Regulation is crucial. One good example is the European Union’s Artificial Intelligence Act, which centres consumers’ right to data privacy and protection and aims to establish “a safe and controlled space for experimentation”.

    Consumers and users should form third-party organizations, similar to the unions used by gig workers, to rate companies on whether they request consent and allow people to opt out of experimentation, and how transparent they are.

    Driving forward the frontier of science–industry experimentation requires practices, rules and regulations that ensure mutually beneficial outcomes for people, organizations and society.

    Competing Interests

    The author declares no competing interests.

    [ad_2]

    Source link

  • A thaw in scientific relations could help clear the air in India and Pakistan

    A thaw in scientific relations could help clear the air in India and Pakistan

    [ad_1]

    A train passes through heavy smog on the outskirts of Amritsar, with the sun creating an ethereal glow

    A train cuts through winter smog in Amritsar, India.Credit: Narinder Nanu/AFP/Getty

    A toxic haze has descended over a land area shared by some 500 million people in the northern parts of India and Pakistan. Its sources include the industrial emissions, domestic fires and diesel and petrol exhausts that form the largest components of air pollution in many parts of the world. But during winter in South Asia, crop residue burning is estimated to be the biggest source. It is an annual event that massively worsens the region’s atmospheric concentrations of fine particulate matter — that measuring 2.5 micrometres or less in diameter. These concentrations already exceed the safe limit advised by the World Health Organization. Air pollution is a leading cause of child death and is devastating for the communities that have to endure it. It also contains climate-altering compounds.

    In Nature this week, researchers show how the field of computational social science, along with publicly available data, could help authorities in India and Pakistan begin to address a problem that affects both nations (G. Dipoppa & S. Gulzar Nature 634, 1125–1131; 2024). The work also highlights what could be achieved if scientific links between the two countries were not frozen as a result of worsening relations between their governments. An overdue thaw could save lives and improve health in both nations.

    The annual winter burning of crop waste in South Asia has its roots in earlier science. High-yielding crop varieties born of 1960s-era green-revolution technologies, combined with mechanization, have enabled farmers in the nations’ agricultural heartlands to grow wheat and rice on the same fields in the same year. Once a rice crop has been harvested, farmers burn millions of tonnes of leftover materials, clearing the land for the wheat-planting season. The resulting haze cuts visibility to a few metres, shuts schools, impedes road transport and causes flights to be cancelled.

    Researchers are actively studying both the extent of the pollution and prevention strategies. Gemma Dipoppa at Brown University in Providence, Rhode Island, and Saad Gulzar at Princeton University in New Jersey have examined the authorities’ responses to the fires in India and Pakistan over a ten-year period, from 2012 to 2022. The authors compared fire, air pollution and wind-speed data with police and court records of action taken against farmers. They also studied the effects of the pollution on health. Burning crop waste is against the law in both countries and violations can lead to farmers being fined or even imprisoned. But many are willing to take that risk. And the sheer number of farmers lighting fires at the same time makes it unfeasible for the authorities to deal with them all.

    The authors found that officials in both countries are more likely to take action against farmers if winds are blowing pollution across home turf, and that crop residue burning decreases as a result. They also found that this effect is larger in areas close to the border between the two countries — in other words, farmers in both nations are more likely to be penalized for crop residue burning if the wind is blowing towards their own side. This raises questions that would benefit from further enquiry. For example, to what extent might India’s and Pakistan’s authorities be cancelling out each other’s pollution-control efforts close to the border? And on days when one country is putting resources into dealing with high levels of pollution within its own borders, is it also receiving more pollution from its neighbour?

    Further research — both analyses of remote data and field-based studies — will help researchers to understand the perspectives of farmers and the factors underlying the actions of government officials.

    Efforts to answer these and other questions would benefit from greater collaboration. However, at present, there are minimal links between researchers in India and Pakistan. Non-governmental links (sometimes called track-two diplomacy), including scientific connections, are the weakest they have been in around a decade. Scientists used to be able to meet through the eight-country South Asian Association for Regional Cooperation (SAARC), based in Kathmandu, but SAARC has not been functioning, mainly because of the continuing tensions between India and Pakistan. The agricultural scientists’ committee has not met in five years. There’s a strong case for such links to be revived.

    So much could be gained if researchers in the two nations could communicate better, work together and study each other’s situation. Dipoppa and Gulzar’s work illustrates what can be achieved with open data, and why science should not be done solely within national borders. When it comes to addressing problems with a regional or global dimension — and when people’s lives and health are at stake — policymakers must prioritize collaboration.

    [ad_2]

    Source link

  • How to spend one trillion dollars: the US decarbonization conundrum

    How to spend one trillion dollars: the US decarbonization conundrum

    [ad_1]

    A customer charges a Tesla at a new Rove EV charging center in Santa Ana, California

    Subsidizing people to buy electric vehicles can end up wasting money if they would have purchased one anyway.Credit: Paul Bersebach/MediaNews Group/Orange County Register via Getty

    In the past three years, the United States has legislated to spend more than US$1 trillion on decarbonization. The Inflation Reduction Act (IRA) and the Bipartisan Infrastructure Law (BIL) focus that spending on policies that will accelerate the country’s transition to a low-carbon economy, such as tax credits for renewable energy and federal subsidies for electric vehicles.

    For the most part, state and local governments will be implementing these policies. For example, states can establish programmes to oversee rebates for energy efficiency and electrification of housing and appliances from the $4.3-billion Home Owner Managing Energy Savings (HOMES) Program and the $4.5-billion High-Efficiency Electric Home Rebate Program. Building codes and land-use policies fall under the jurisdiction of thousands of local governments across the country.

    Governments will face difficult choices about which programmes to prioritize and how to design them to maximize participation by households and businesses. There are many unknowns around changing people’s behaviour, including how to increase the use of public transportation, replace fossil-fuel heating technologies with electric alternatives and encourage farmers to adopt low-carbon practices.

    Partnering with researchers to evaluate the performance of decarbonization programmes and then improve them will be essential to ensuring that these investments achieve their greatest impact.

    Researchers have much to gain, too. Tracking these interventions offers an unprecedented opportunity to learn about the efficacy, cost-effectiveness and the effects of a range of decarbonization programmes. Evidence from all these projects can guide further actions to achieve the nation’s 2050 greenhouse-gas-reduction commitments under the Paris climate agreement by committing to what works best. But this will be possible only if policies are designed and implemented in a way that allows such evaluations.

    Here, we propose three principles that could improve the cost-effectiveness of IRA and BIL investments — properly incentivizing behaviour change, quantifying spillovers and evaluating trade-offs. We call on policymakers at all levels of government to use these and to work with researchers to establish an evidence base. And we propose three actions that will increase the value of the historic investment of the IRA and BIL, and guide future decarbonization efforts.

    Target incentives to promote behaviour change

    How well decarbonization will proceed depends on the capacity to change human behaviour. The IRA and BIL include a range of financial incentives designed to induce households and firms to adopt emerging technologies, change production and consumption practices, and adjust patterns of energy use to be more sustainable. For example, the IRA targets financial credits at electric vehicles, renewable-energy production and clean manufacturing. Three-fifths of these incentives are intended to sway behaviour, and how well they do so will affect the bill’s emissions-reduction potential (see ‘Uncertain policies’).

    Uncertain policies. A stacked percentage pie bar showing behavioural and non-behavioural changes such as clean energy and electric vehicles and their projected funding.

    Source: Congressional Budget Office

    The challenge for policymakers is how to target these incentives in the way that motivates behaviour change most effectively. That means focusing on those who need the incentives before they will take action, while avoiding unnecessary payments to those who would act without them. For example, it would be more effective to use subsidies to encourage someone who is undecided about buying an electric vehicle than to allocate the money to someone who has already bought one or plans to do so.

    Yet this often happens. Researchers have found that recipients of energy-efficiency subsidies would have made the same investments at a lower subsidy level1. They have also revealed that 60–80% of Tesla electric cars purchased in California before 2018 would have been bought without a subsidy, despite most buyers having received one2. Those subsidies were therefore expensive, ranging from $25,000 to $52,000 per car. Such ill-targeted policies can waste public funds and increase decarbonization costs.

    By identifying the groups that are most responsive to subsidies, governments can achieve carbon reductions at a lower cost. Policies might be differentiated on the basis of characteristics such as income or geographical location3,4. For example, in 2022, California imposed income caps on its electric-vehicle subsidies. An analysis of whether this had the intended effect of increasing uptake by targeted households would be helpful for other jurisdictions designing similar subsidies.

    Similar targeting would be beneficial in other areas. For example, the US Weatherization Assistance Program provides free home energy-efficiency upgrades for low-income households. Subsidies targeted at specific types of housing stock can be more cost-effective than if they were made available universally5.

    Isolating the impact of an intervention can be challenging without careful research designs and appropriate statistical approaches. For instance, simply comparing the level of adoption of electric cars in states with and without subsidies will overstate the role of subsidies, because states that offer them typically have high underlying demand for such vehicles. Many other factors — such as fuel and electricity prices, income and attitudes — can also influence the comparison and can be difficult to control for.

    Therefore, we recommend that agencies deploying IRA and BIL funds integrate evaluation into programme designs from the earliest planning stage. By getting the design right from the start, the resulting evaluations will be more credible and can serve as models for future programmes to emulate or avoid.

    There are many suitable research designs, with the gold standard being randomized controlled trials. Randomizing features of programme design or the timing of roll-out across groups or geographies can create clear comparison groups. When randomization is not feasible, methods such as comparing outcomes for otherwise equivalent groups just above and below an income-based threshold can isolate causal effects.

    Unfortunately, the methods currently used to report programme impacts often do not enable credible accounting of the causal impacts of targeted programmes on emissions reductions. Collaborating with academic researchers provides a cost-effective way to access the expertise needed to build scientific consensus on optimal decarbonization strategies.

    Quantify how policy actions accelerate learning

    The development and diffusion of technologies, such as aircraft, semiconductors and wind turbines, often involve learning by doing. Typically, production costs decrease as experience accumulates, whether from one’s own efforts or from others’6,7. This process has implications for the short-term success or failure of technology adoption and for long-term economic growth8.

    As with other public investments in research and development, decarbonization programmes can boost production experience, accelerate learning and drive down costs, to speed up technology adoption. The extent to which this occurs depends on the magnitude and scope of learning, the effectiveness of knowledge spillovers between firms and how well a given programme design facilitates learning.

    Two workers install solar panels on the rooftop of a home in Poway, California, U.S.

    The costs of installing solar panels will fall as workers gain and share experience.Credit: Sandy Huffaker/Bloomberg via Getty

    In general, subsidies that target technologies that have greater potential for knowledge spillovers are likely to be more cost-effective. Research indicates, for example, that such spillovers have lowered costs in wind turbine production9 but had less impact among solar installers10. There is less evidence for the impacts of learning on other clean-energy sectors, such as lithium-ion batteries and green hydrogen. This knowledge gap limits the development of policy instruments that might be effective catalysts for deep decarbonization in the longer run.

    We recommend that evaluations of learning be conducted as the IRA and BIL programmes are rolled out over time and across jurisdictions. These might provide evidence on how changes in manufacturing experience for lithium-ion batteries or wind turbines induced by the IRA and BIL programmes affect the unit cost of production, for example.

    Similarly, learning in the installation phase of renewable technologies is important to understand. Armed with this knowledge, policymakers can improve their understanding of the cost-effectiveness of a range of decarbonization programmes and allocate funding to achieve superior decarbonization outcomes.

    Measure trade-offs

    By executive order, the White House has defined a goal that 40% of the overall benefits from the IRA and BIL should flow to communities that are “marginalized and overburdened by pollution” (see go.nature.com/3bgvcic). These objectives stem partly from evidence that clean-energy programmes, such as weatherization assistance, residential solar incentives and electric-vehicle subsidies, have typically gone disproportionately to homeowners and high-income households who can afford the cost11. Meanwhile, the air, water and land pollution associated with energy and transportation services and industrial activity disproportionately affect disadvantaged populations12.

    Developing comprehensive evidence on decarbonization programmes requires evaluating various types of impact and differentiating the effects on vulnerable communities from those experienced elsewhere. For instance, tax credits for renewable electricity production by wind farms and for battery storage might generate jobs while mitigating urban air pollution. They might also place a disproportionate environmental burden on certain communities if battery-manufacturing plants emit lead or other toxic chemicals into the air. Failing to anticipate such trade-offs could lead to a distribution of subsidies and tax credits that exacerbates inequalities among communities, businesses and households.

    To understand the distributional impacts of decarbonization programmes, it is important to identify the underlying drivers that explain variation in programme participation across different groups, as well as the indirect effects of such programmes on industrial pollution. Regular, systematic evaluation of how prioritized groups respond to available decarbonization programs could help to assess the importance of monetary and other barriers that might hinder uptake in disadvantaged communities, such as a lack of information, financial constraints, programme transparency or complexity. Removing such barriers could be cost-effective in increasing programme adoption.

    Three action items

    First, states and local governments need to develop a harmonized data-collection and monitoring infrastructure to guide their decarbonization programmes. Measuring and reporting outcome data from all programmes in the same way and using the same metrics is crucial to identifying the types of programme and policy that are most cost-effective. Collection of baseline data needs to start as soon as possible — measuring a programme’s impact usually requires knowing how things looked beforehand.

    Federal agencies should issue guidance or design data platforms to help harmonize data collection by states or local governments. One good example is the IRA’s Home Energy Rebates Program13. Harmonized data should be rich enough to be able to target low-income or other populations, and should be collected with enough frequency to be able to evaluate short-term and longer-term impacts after a programme has ended.

    Second, we recommend prioritizing rigorous independent analysis by third-party researchers to ensure the credibility and transparency of evaluations of IRA and BIL programmes. As agencies develop their plans, they should seek opportunities to partner with researchers to shape methodological designs and data-collection strategies. Making anonymized microdata public allows independent parties to conduct their own research, leading to greater diversity of insights and maximizing opportunities for learning. This is consistent with the requirement for Open Data Plans under Title II of the Foundations for Evidence-Based Policymaking Act (see go.nature.com/4j3tgjp) and should become standard procedure.

    Third, jurisdictions should share the outcomes of different interventions and coordinate their policy efforts. This will help state and local governments to avoid repeating mistakes previously made elsewhere. And jurisdictions can coordinate policy efforts to encourage reliable research practices, limit overlap between similar impact studies and explore why some policies are more cost-effective than others.

    When historians look back at the energy transition of the twenty-first century, success will have been determined by an immense amount of learning from early attempts to stimulate the development and deployment of clean technologies. If effectively coordinated and evaluated, with the involvement of the scientific community, such efforts have the potential to inform decarbonization programmes around the world and achieve decarbonization at a manageable cost.

    [ad_2]

    Source link

  • Local government actions can curb air pollution in India and Pakistan

    Local government actions can curb air pollution in India and Pakistan

    [ad_1]

    Nature, Published online: 23 October 2024; doi:10.1038/d41586-024-03314-4

    Burning crop waste causes devastating pollution in South Asia. When local administrators have appropriate incentives to control burning, incidents go down — a finding that could guide future efforts to manage air pollution.

    [ad_2]

    Source link

  • Differences in misinformation sharing can lead to politically asymmetric sanctions

    [ad_1]

    Sample and basic data collection for 2020 election study

    First, we collected a list of Twitter users who tweeted or retweeted either of the election hashtags #Trump2020 and #VoteBidenHarris2020 on 6 October 2020. We also collected the most recent 3,200 tweets sent by each of those accounts. We processed tweets and extracted tweeted domains from 34,920 randomly selected users (15,714 shared #Trump2020 and 19,206 shared #VoteBidenHarris2020), and filtered down to 12,238 users who shared at least five links to domains used by the ideology estimator of ref. 57. We also excluded 426 ‘elite’ users with more than 15,000 followers who are probably unrepresentative of Twitter users more generally (because of this exclusion, suspension data were not collected for these users; however, as described in Supplementary Information section 2, our main results on the association between political orientation and low-quality news sharing are also observed among these elite users). These data were collected as part of a project that was approved by the Massachusetts Institute of Technology Committee on the Use of Humans as Experimental Subjects Protocol 91046.

    We then constructed a politically balanced set of users by randomly selecting 4,500 users each from the remaining 4,756 users who shared #Trump2020 and 7,056 users who shared #VoteBidenHarris2020. After 9 months, on 30 July 2021, we checked the status of the 9,000 users and assessed suspension. We classify an account as having been suspended if the Twitter application programming interface (API) returned error code 63 (‘User has been suspended’) when querying that user.

    To measure a user’s tendency to share misinformation, we follow most other researchers in this space11,12,58,59 and use news source quality as a proxy for article accuracy, because it is not feasible to rate the accuracy of individual tweets at scale. Specifically, to quantify the quality of news shared by each user, we leveraged a previously published set of 60 news sites (20 mainstream, 20 hyper-partisan 20 fake news; Table 1) whose trustworthiness had been rated by 8 professional fact-checkers as well as politically balanced crowds of laypeople. The crowd ratings were determined as follows. A sample of 971 participants from the USA, quota-matched to the national distribution on age, gender, ethnicity and geographic region, were recruited through Lucid60. Each participant indicated how much they trusted each of the 60 news outlets using a 5-point Likert scale. For each outlet, we then calculated politically balanced crowd ratings by calculating the average trust among Democrats and the average trust among Republicans, and then averaging those two average ratings.

    We also examined Reliability ratings for a set of 283 sites from Ad Fontes Media, Inc., Factual Reporting ratings for a set of 3,216 sites from Media Bias/Fact Check and Accuracy ratings for a set of 4,767 sites from a recent academic paper by Lasser et al.33. We then used the Twitter API to retrieve the last 3,200 posts (as of 6 October 2020) for each user in our study, and collected all links to any of those sites shared (tweeted or retweeted) by each user. Following the approach used in previous work58,59, we calculated a news quality score for each user (bounded between 0 and 1) by averaging the ratings of all sites whose links they shared, separately for each set of site ratings. Finally, we transform these ratings into low-quality news sharing scores by subtracting the news quality ratings from 1. Over 99% of users in our study had shared at least one link to a rated domain. When combining the four expert-based measures into an aggregate news quality score, we replaced missing values with the sample mean; PCA indicated that only one component should be retained (87% of variation explained), which had weights of 0.50 on Pennycook and Rand (ref. 38) fact-checker ratings, 0.51 on Ad Fontes Media Reliability ratings, 0.48 on Media Bias/Fact Check Factual Reporting ratings and 0.51 on Lasser et al.33 Accuracy ratings. In all PCA analyses, we use parallel analysis to determine the number of retained components.

    To measure a user’s political orientation, we first classify their partisanship on the basis of whether they shared more #Trump2020 or #VoteBidenHarris2020 hashtags. Additionally, we retrieved all accounts followed by users in our sample and used the statistical model from ref. 39 to obtain a continuous measure of users’ ideology on the basis of the ideological leaning of the accounts they followed. Similarly, we used the statistical models from ref. 40 and ref. 12 to estimate users’ ideology using the ideological leanings of the news sites that the users shared content from. We also calculated user ideology by averaging political leanings of domains they shared through tweets or retweets on the basis of the method in ref. 12. The intuition behind these approaches is that users on social media are more likely to follow accounts (and share news stories from sources) that are aligned with their own ideology than those that are politically distant. Thus, the ideology of the accounts the user follows, and the ideology of the news sources the user shares, provide insight into the user’s ideology. When combining these four measures into an aggregate political orientation score, we replaced missing values with the sample mean; PCA indicated that only one component should be retained (88% of variation explained), which had weights of 0.49 on hashtag-based partisanship, 0.49 on follower-based ideology, 0.51 on sharing-based ideology estimated through ref. 40 and 0.51 on sharing-based ideology estimated through ref. 12. We also used this aggregate measure to calculate a user’s extent of ideological extremity by taking the absolute value of the aggregate ideology measure; and we used PCA to combine measures of the standard deviation across a user’s tweets of news site ideology scores from ref. 12 and ref. 40, and standard deviation of ideology of accounts followed from ref. 39, as a measure of the ideological uniformity (versus diversity) of news shared by the user.

    Policy simulations

    In addition to the regression analyses, we also simulate politically neutral suspension policies and determine each user’s probability of suspension; and from this, determine the level of differential impact we would expect in the absence of differential treatment. The procedure is as follows. First, we identify a set of low-quality sources that could potentially lead to suspension. We do so using the politically balanced layperson trustworthiness ratings from ref. 38, as well as using the fact-checker trustworthiness ratings from that same paper. For both sets of ratings, there is a natural discontinuity at a value of 0.25 (on a normalized trust scale from 0 = Not at all to 1 = Entirely) (Extended Data Fig. 2). Thus, we consider sites with average trustworthiness ratings below 0.25 to be ‘low quality’; and for each user, we count the number of times they tweet links to any of these low-quality sites.

    We then define a suspension policy as the probability of a user getting suspended each time they share a link to a low-quality news site. We model suspension as probabilistic because many (almost certainly most) of the articles from low-quality news sites are not actually false, and sharing such articles does not constitute an offence. Thus, we consider who would get suspended under suspension policies that differ in their harshness, varying from a 0.01% chance of getting suspended for each shared link to a low-quality news site up to a 10% chance. Specifically, for each user, we calculate their probability of getting suspended as

    $$P\left({\rm{suspended}}\right)=1-{\left(1-k\right)}^{L}$$

    where L is the number of low-quality links shared, and k is the probability of suspension for each shared link (that is, the policy harshness). The only way the user would not get suspended is if on each of the L times they share a low-quality link, they are not suspended. Because they do not get suspended with probability (1 − k), the probability that they would never get suspended is (1 − k)L. Therefore, the probability that they would get suspended at some point is 1 − (1 − k)L.

    We then calculate the mean (and 95% confidence interval) of that probability across all Democrats versus Republicans in our sample (as determined by sharing Biden versus Trump election hashtags). The results of these analyses are shown in Fig. 3b, and Supplementary Information section 2 presents statistical analyses of estimated probability of suspension on the basis of each measure of political orientation.

    We also do a similar exercise using the likelihood of being a bot, rather than low-quality news sharing. The algorithm of ref. 43 provides an estimated probability of being a bot for each user, on the basis of the contents of their tweets. We define a suspension policy as the minimum probability of being human, k, required to avoid suspension (or, in other words, a threshold on bot likelihood above which the user gets suspended). Specifically, for a policy of harshness k, users with bot probability greater than (1 − k) are suspended. The results of these analyses are shown in Fig. 3c.

    Reanalyses of extra datasets

    Facebook sharing in 2016 by users recruited through YouGov

    Here we analyse data presented in ref. 11. A total of n = 1,191 survey respondents recruited using YouGov gave the researchers permission to collect the links they shared on Facebook for 2 months (through a Facebook app), starting in November 2016. As part of the survey, participants self-reported their ideology (using a 5-point Likert scale; not including participants who selected ‘Not sure’, yielding n = 995 respondents with usable ideology data) and their party affiliation (Democrat, Republican, Independent, Other, Not sure). As in our Twitter studies, we calculate low-quality information sharing scores for each user by using the fact-checker and politically balanced crowd ratings for the 60 news sites from ref. 38, as described above in Table 1. A total of 893 participants shared at least one rated link.

    Twitter sharing in 2018 and 2020 by users recruited through Prolific

    Here we analyse data presented in ref. 41. A total of n = 2,100 participants were recruited using the online labour market Prolific in June 2018. Twitter IDs were provided by participants at the beginning of the study. However, some participants entered obviously fake Twitter IDs—for example, the accounts of celebrities. To screen out such accounts, we followed the original paper and excluded accounts with follower counts above the 95th percentile in the dataset. We had complete data and usable Twitter IDs for 1,901 users. As part of the survey, participants self-reported the extent to which they were economically liberal versus conservative, and socially liberal versus conservative, using 5-point Likert scales. We construct an overall ideology measure by averaging over the economic and social measures. The Twitter API was used to retrieve the content of their last 3,200 tweets (capped by the Twitter API limit). Data were retrieved from Twitter on 18 August 2018, and then again on 12 April 2020 (the latter data pull excludes tweets collected during the former data pull). We calculate low-quality information sharing scores for each user by using the fact-checker and politically balanced crowd ratings for the 60 news sites from ref. 38, as described above in Table 1. A total of 594 participants shared at least one rated link in the 2018 data pull and 379 participants shared at least one rated link in the 2020 data pull; 288 participants shared at least one rated link in both data pulls.

    Twitter sharing in 2021 by users who followed at least three political elites

    Here we analyse data presented by Mosleh and Rand13, in which Twitter accounts for 816 elites were identified, and then 5,000 Twitter users were randomly sampled from the set of 38,328,679 users who followed at least three of the elite accounts. Each user’s last 3,200 tweets were collected on 23 July 2021, and sharing of low-quality news domains was assessed using the fact-checker and politically balanced crowd ratings from ref. 38. A total of 3,070 users shared at least one rated link. The statistical model from ref. 39 was used to obtain a continuous measure of users’ ideology on the basis of the ideological leaning of the accounts they followed.

    Twitter sharing in 2022 by users who followed at least three political elites

    Here we analyse previously unpublished data, in which 11,805 Twitter users were sampled from a set of 296,202,962 users who followed at one of the political elite accounts from ref. 41. We randomly sampled from users who had more than 20 lifetime tweets and followed at least three political elites for whom we had a partisanship rating. Each user’s last 3,200 tweets were collected on 25 December 2022, and sharing of low-quality news domains was assessed using the fact-checker and politically balanced crowd ratings from ref. 38. A total of 4,040 users shared at least one rated link. The statistical model from ref. 39 was used to obtain a continuous measure of users’ ideology on the basis of the ideological leaning of the accounts they followed.

    Twitter sharing in 2023 by users who followed at least one political elite, stratified on follower count

    Here we analyse previously unpublished data in which 11,886 Twitter users were randomly sampled, stratified on the basis of log10-transformed number of followers (rounded to the nearest integer) from the same set of 296,202,962 users who followed at one political elite account. On 4 March 2023, we retrieved all tweets made by each user since 22 December 2022 using the Twitter Academic API. Sharing of low-quality news domains was assessed using the fact-checker and politically balanced crowd ratings from ref. 38. A total of 4,408 users shared at least one rated link. The statistical model from ref. 39 was used to obtain a continuous measure of users’ ideology on the basis of the ideological leaning of the accounts they followed.

    Sharing of false claims on Twitter

    Here we analyse data from Ghezae et al.53. Unlike the previous analyses, this dataset does not use domain quality as a proxy for misinformation sharing. Instead, sets of specific false versus true headlines were used. The headline sets were assembled by collecting claims that third-party fact-checking websites such as snopes.com or politifact.org had indicated were false, and collecting veridical claims from reputable news outlets. Furthermore, the headlines were pre-tested to determine their political orientation (on the basis of survey respondents’ evaluation of how favourable the headline, if entirely accurate, would be for the Democrats versus Republicans; see ref. 56 for details of the pre-testing procedure).

    Survey participants were recruited to rate the accuracy of each URL’s headline claim. Specifically, each participant was shown ten headlines randomly sampled from the full set of headlines, and rated how likely they thought it was that the headline was true using a 9-point scale from ‘not at all likely’ to ‘very likely’. For each headline, we created politically balanced crowd ratings by averaging the accuracy ratings of participants who identified as Democrats, averaging the accuracy ratings of participants who identified as Republicans and then averaging these two average ratings. We then classify URLs as inaccurate (and thus as misinformation) on the basis of crowd ratings if the politically balanced crowd rating was below the accuracy scale midpoint.

    Additionally, the Twitter Academic API was used to identify all Twitter users who had posted primary tweets containing each URL. These primary tweets occurred between 2016 and 2022 (2016, 1%; 2017, 2%; 2018, 4%; 2019, 5%; 2020, 34%; 2021, 27%; 2022, 27%). The ideology of each of those users was estimated using the statistical model from ref. 39 on the basis of the ideological leaning of the accounts they followed. This allows us to count the number of liberals and conservatives who shared each URL on Twitter.

    The dataset pools across three different iterations of this procedure. The first iteration used 104 headlines selected to be politically balanced, such that the Democrat-leaning headlines were as Democrat-leaning as the Republican-leaning headlines were Republican-leaning; n = 1,319 participants from Amazon Mechanical Turk were then shown a random subset of headlines that were half politically neutral and half aligned with the participant’s partisanship. The second iteration used 155 headlines (of which 30 overlapped with headlines used in the first iteration); n = 853 participants recruited using Lucid rated randomly selected headlines. The third iteration used 149 headlines (no overlap with previous iterations); n = 866 participants recruited using Lucid rated randomly selected headlines. The Amazon Mechanical Turk sample was a pure convenience sample, whereas the Lucid samples were quota-matched to the national distribution on age, gender, ethnicity and geographic region, and then true independents were excluded. For the 30 headlines that overlapped between iterations 1 and 2, the politically balanced crowd accuracy ratings from Amazon Mechanical Turk and Lucid correlated with each other at r(28) = 0.75. Therefore, we collapsed the politically balanced ratings across platforms for those 30 headlines. In total, this resulted in a final dataset with fact-checker ratings, politically balanced crowd ratings and counts of numbers of posts by liberals and conservatives on Twitter for 378 unique URLs.

    Finally, we also classified the topic of each URL. To do so, we used Claude, an artificial intelligence system designed by Anthropic that emphasizes reliability and predictability, and has text summarization as one of its primary functions. We uploaded the full set of headlines to the artificial intelligence system, and first asked it to summarize the topics discussed in the headlines. We then asked it to indicate the topic covered in each specific headline, and manually inspected the results to ensure that the classifications were sensible. Next, we examined the frequency of each topic, synthesized the results into a set of six overarching topics and then finally asked the artificial intelligence system to categorize each headline into one of these six topics. This process led to the following distribution of topics: US Politics (174 headlines), Social Issues (91 headlines), COVID-19 (48 headlines), Business/Economy (41 headlines), Foreign Affairs (28 headlines) and Crime/Justice (26 headlines). As a test of the robustness of the classification, we also asked another artificial intelligence system, GPT4, to classify the first 100 headlines into the six topics. We found that Claude and GPT4 agreed on 80% of the headlines.

    Sharing intentions of false COVID-19 claims across 16 countries

    Here, we examine survey data from ref. 37. In these experiments, participants were recruited from 16 different countries using Lucid, with respondents quota-matched to the national distributions on age and gender in each country. Participants were shown ten false and ten true claims about COVID-19 (sampled from a larger set of 45 claims), presented without any source attribution. The claims were collected from fact-checking organizations in numerous countries, as well as sources such as the World Health Organization’s list of COVID-19 myths. This approach removes ideological variation in exposure to misinformation online13, as well as any potential source cues/effects, and directly measures variation in the decision about what to share.

    As in our other analyses, we complement the professional veracity ratings with crowd ratings. Specifically, n = 8,527 participants in the Accuracy condition rated the accuracy of each of the headlines they were shown using a 6-point Likert scale. We calculate the average accuracy rating for each statement in each country, and classify statements as misinformation if that average rating is below the scale midpoint.

    Our main analyses then focus on the responses of the n = 8,597 participants from the Sharing condition, in which participants indicated their likelihood of sharing each claim using a 6-point Likert scale. To calculate each user’s level of misinformation sharing, we first discretize the sharing intentions responses such that choices of 1 (Extremely unlikely), 2 (Moderately unlikely) or 3 (Slightly unlikely) on the Likert scale are counted as not shared, whereas choices of 4 (Slightly likely), 5 (Moderately likely) or 6 (Extremely likely) are counted as shared. We then determine, for each user, the fraction of shared articles that were (1) rated as false by fact-checkers, and (2) rated as below the accuracy scale midpoint on average by respondents in the Accuracy condition.

    We then ask how misinformation sharing varies with ideology within each country. Specifically, we construct a conservatism measure by averaging responses to two items from the World Values Survey that were included in the survey, which asked how participants would place their views on the scales of ‘Incomes should be made more equal’ versus ‘There should be greater incentives for individual effort’ and ‘Government should take more responsibility to ensure that everyone is provided for’ versus ‘People should take more responsibility to provide for themselves’ using 10-point Likert scales. Pilot data collected in the USA confirmed that responses to these two items correlated with self-report conservatism (r(956) = 0.32 for the first item and r(956) = 0.40 for the second item).

    Reporting summary

    Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

    [ad_2]

    Source link

  • How walkable is your city? Online tool shows how major centres measure up

    How walkable is your city? Online tool shows how major centres measure up

    [ad_1]

    The push for sustainable living has given rise to the idea of a 15-minute city: one in which residents can access essential amenities on foot or by bike in 15 minutes or less. Writing in Nature Cities, Bruno et al. built an online platform to analyse how close the world’s cities come to this ideal — and how they could be redesigned to realize it (M. Bruno et al. Nature Cities 1, 633–641; 2024). The authors found that the fraction of residents who have 15-minute access to essential services in a city is closely related to the average time it takes to reach these points, suggesting that cities with poor average accessibility are also those with the highest inequality. They also devised an algorithm to assess what proportion of amenities would need to be relocated to make them accessible to everyone.

    Competing Interests

    The author declares no competing interests.

    [ad_2]

    Source link

  • how conquest and carnage have decimated landscapes worldwide

    how conquest and carnage have decimated landscapes worldwide

    [ad_1]

    The Burning Earth: A History Sunil Amrith W. W. Norton (2024)

    In the 1620s, King Charles I of England commissioned a Dutch water engineer, Cornelius Vermuyden, to drain the flat fenlands of East Anglia, which he considered a desolate wasteland. Locals were outraged. These wetlands, writes historian Sunil Amrith in The Burning Earth, “sustained a richness of human and more-than-human life that was now in danger”. As a pamphleteer at the time declared, many thousands of cottagers lived by harvesting “reeds, fodder, thacks, turves, flaggs, hassocks, segg” and “many other fenn commodytyes”.

    Locals, dubbed the Fen Tigers, smashed the dams, dykes and sluice gates that had been installed to divert rivers. But England’s political elite were determined to see nature “bound into service”. The marshes were ultimately drained and the land repurposed for agriculture, with the benefits accruing to rich landowners. Now known as the bread-basket of Britain, this once biodiverse wetland is at perpetual risk of flooding.

    This pattern of conquest and carnage — pitting rich against poor, colonialist against indigenous, control of nature against the flourishing of the wild — has, tragically, been repeated countless times throughout history and across the globe. Amrith narrates this sorry (and sometimes inspiring) saga with flair, in his epic exploration of human innovation and destruction.

    The fenfolk of East Anglia, he notes, were not the first to lose their livelihoods and wild land to the rich — and not the last to fight back. People with power and privilege conquered the world with machinery and lethal weapons, but the poor and powerless persevere. Indigenous peoples of Brazil, Indonesia and India continue to fight corporations that encroach on their pristine rainforests, just as Fen Tigers fought for their marshlands. It is these overlooked environmental and political conflicts on which Amrith centres his narrative.

    Silhouette of plants growing in flooded field, Lincolnshire Fens, Donna Nook, England, UK.

    The fenlands of eastern England have been at constant risk of flooding since they were repurposed for agriculture.Credit: Chris Howes/Wild Places Photography/Alamy

    Bloody commerce

    For 600 years, many of these conflicts have revolved around the pursuit of luxuries. When Portuguese ships reached the North Atlantic island of Madeira in 1426, the colonists set fire to most of its forests, and later enslaved Indigenous Guanches from the nearby Canary Islands to clear the ground for sugar cultivation. In the 1470s, the Portuguese reached the coast of Ghana. In Elmina, they built a fortress that thrived as a centre first for trade in gold, ivory and peppers, and later for “the bloody Atlantic commerce in enslaved human beings”.

    At every stage, European colonists spread death and environmental destruction. In sixteenth-century Peru, Spaniards kidnapped Indigenous people and forced them to mine a mineral source of mercury called cinnabar — used to extract silver from ore. Toxic vapours from the cinnabar refineries poisoned water, mammals, fish and the shackled humans toiling at “the mine of death” at Huancavelica. As Amrith quotes one report of the time: “there used to be in this mountain”, it laments, “deer with antlers, and now not even grass is found”. Today, mercury still seeps from roads and houses made with contaminated bricks.

    Rebellion and retaliation

    But everywhere that people were enslaved, significant numbers rebelled. In Palmares, Brazil, a 10,000–20,000-strong quilombo, or community of once-enslaved fugitives, formed a self-governing society. Most residents, who survived on subsistence agriculture and trade, had roots in Angola and Congo, but some were Indigenous Brazilians, Jews and Muslims. Together, they held off attacks by Dutch and Portuguese militaries for almost a century, before the quilombo was conquered in 1694.

    Elmina Slave Castle on the west coast of Ghana where slaves were held before their forced passage to the new world.

    A fortress built in Elmina, Ghana, was used to hold enslaved people captive.Credit: Chuck Bigger/Alamy

    Conflicts over land and nature continue today. For centuries, Indigenous peoples in rainforests grew food, including fruit and nut trees, for their own needs; as they moved to new areas, the forests rebounded. By the 1980s, however, a contagion of chainsaws and burning had led to the loss of an area of Amazonian and southeast Asian rainforest equivalent to half the size of India. In Brazil, labour leader and conservationist Chico Mendes led the fight to establish forest reserves inhabited and managed by locals. In 1990, the state of Acre created the first such zone: the 500,000-hectare Chico Mendes Extractive Reserve. But Mendes himself had been shot dead in front of his house in Xapuri in 1988, allegedly by gunmen hired by local landowning ranchers.

    In a similar grievous tale in Nigeria, environmental activist Ken Saro-Wiwa founded the Movement for the Survival of the Ogoni People, rallying 300,000 in 1993 to protest against rampant oil pollution by the energy company Shell, which had left the landscape a “desolate expanse of blackened crust”. Saro-Wiwa and eight other Ogoni leaders were imprisoned and hanged by Nigeria’s military government in 1995.

    Ongoing battle

    Development isn’t entirely bad, as Amrith stresses. Rates of death from infectious diseases have fallen drastically around the world since the start of the twentieth century, thanks to sanitation, vaccines and antibiotics. The Green Revolution — a period of rapid development of high-yield, disease-resistant wheat and rice varieties — led to tremendous booms in crop production. Between 1961 and 2014, production of cereal crops increased by 280% worldwide.

    But the Green Revolution had unintended impacts. Petrochemicals furnished the pesticides and fertilizers on which high-yield seeds depended. Diesel powered the groundwater pumps that irrigated the harvests, and pesticides permeated and poisoned the soil. In India, the revolution also perpetuated inequality between farmers who had access to transport, water and money, and “those with land too measly, too stony, too unyielding to accept new seeds”. Thousands of farmers in India die by suicide every year, faced with debt to pay for seeds and fertilizers, amid heatwaves and drought caused by climate change.

    If there’s cause for hope, it comes from those who continue to fight for environmental justice, often from the margins. In 2006, in West Timor, Indonesia, 150 women surrounded a marble mine on Mount Mutis, protesting against the destruction of eucalyptus forests and waterways on which they depended. A few years later, mining there ceased.

    And since the late 1990s in Bogotá, Colombia, 44,000 square kilometres of road have been transformed for pedestrian use, and an electrified bus network has been introduced. Five hundred kilometres of protected bicycle lanes, championed by civil-society group the Green City, intersect with the bus network.

    “More and more people are challenging the self-destructive folly that captured the imagination of the powerful and privileged for two hundred years,” Amrith writes. Almost 2,000 environmental activists — one-third of them from Indigenous communities — have been murdered around the world in the past decade. Yet powerful movements, especially of young people, continue to fight for Earth’s future.

    For these brave and unwavering humans, we can be grateful.

    [ad_2]

    Source link

  • How influencers and algorithms mobilize propaganda — and distort reality

    How influencers and algorithms mobilize propaganda — and distort reality

    [ad_1]

    Invisible Rulers: The People Who Turn Lies into Reality Renée DiResta Public Affairs (2024)

    Scientific institutions, public-health authorities and academics routinely face criticism and angry denouncements from ideologically motivated detractors who wish to bury inconvenient scientific evidence. With the rise of the Internet and social media, misinformation researchers, especially, have become targets for online partisan attacks (see Nature 630, 548–550; 2024). And academics routinely have to ward off political interference in many countries1.

    Renée DiResta knows this only too well. A former research manager at the Stanford Internet Observatory (SIO) in California, she has been on the receiving end of online attacks for years, owing to her academic work combating misinformation about elections and vaccine efficacy. After a barrage of unsubstantiated accusations — including those levelled in a controversial investigation by the US House of Representatives’ judiciary committee, chaired by Republican congressman Jim Jordan — DiResta found that her research group at the SIO was suddenly dismantled in June, reportedly because of a change in institutional priorities.

    In Invisible Rulers, DiResta documents her stormy personal and professional journey into what she describes as the “fantasy–industrial complex”. It’s an insightful account of how, over the past two decades, social-media influencers, algorithms and crowds have hijacked the public debate on consequential topics — from vaccination campaigns to the validity of elections. The book’s central thesis is this: a few social-media propagandists increasingly have the power to profoundly shape public opinion. And the only maxim that seems to guide their action is, as DiResta puts it: “if you make it trend, you make it true”.

    The book’s title is a reference to public-relations pioneer Edward Bernay’s 1928 work Propaganda, which describes the ‘invisible’ people who fashion public sentiment — including public-relations experts and advertising executives. Today, that power can be in anyone’s hands.

    Charismatic individuals with large online followings are the new invisible rulers. The most elite among them, DiResta writes, possess the storytelling skills of a leading marketing executive, have the audience size of a television anchor and yet create the cozy, intimate feeling of a phone call with your best friend. They can also make immense profits, she notes, by pretending to be an ordinary person who is helping their audience to “break free of the lying mainstream media”.

    Invisible Rulers is DiResta’s attempt to lay out the motivations and methods of these individuals, who, she explains, might project themselves as being anti-elite but are, in fact, a new breed of elite. They often wield incredible power without displaying any commensurate responsibility.

    Alternate realities

    DiResta’s own journey into the world of misinformation began as a concerned mother trying to work out why classroom vaccination rates were declining in California amid a measles outbreak in 2014. She documents how, after joining the vaccine debate in support of a state bill that sought to remove ‘personal belief’ as a valid ground for seeking exemption from mandatory vaccination programmes, she was deluged by online attacks from bots and trolls.

    Although most children in California are vaccinated — signalling broad public consensus that vaccines are beneficial — DiResta describes the jarring experience of stumbling upon a seemingly alternate reality online.

    There, she found a small yet vocal band of people promoting the idea that the government and pharmaceutical industry were colluding to cover up a supposed link between vaccines and autism — a decades-old argument that has been dispelled by research2,3.

    Studies show that a growing minority of the US population now holds this sceptical view. Without intervention, anti-vaccination sentiment might dominate vaccine discourse on social media in the next decade4. Research also affirms DiResta’s contention that those who promote anti-vaccination rhetoric are organized and overlap with groups that champion other pseudoscience topics, such as unproven forms of alternative medicine and COVID-19 misinformation.

    DiResta’s book shines a light on the why. Often, these influencers aren’t conventional celebrities, but ordinary citizens who talk about things that interest them. Such influencers typically don’t start out peddling rumours and disinformation. But some notice that, once they start talking about a certain controversial topic, they receive more engagement on social media. The more they talk about it, the more people ‘like’ and share what they have to say, leading algorithms to recommend their content even more.

    Two Black women, social media influencers and video bloggers speak as U.S. President Donald Trump smiles during a rally.

    Social-media influencers speak at a rally held by US presidential candidate Donald Trump.Credit: Al Drago/Bloomberg/Getty

    The consequences of these misinformation spirals can be felt in the real world. For example, in August, violent riots engulfed the United Kingdom after the tragic stabbing of several young children. Among the triggers were false reports spread on social media — and amplified by far-right influencers — that the perpetrator was a Muslim asylum seeker who arrived in England by boat. The actual assailant was Christian, born in Cardiff and of Rwandan origin.

    Because the rumour and its context were moral, emotional and shocking — qualities that help rumours spread5 — the story received a lot of attention on social media. The trinity of influencers, algorithms and crowds had created an alternative reality and misinformation provided far-right groups with the excuse they needed to leverage a tragedy to unleash violence across the country.

    How a rumour is born

    False rumours can have other nasty side effects, too. DiResta relates how her team was on the receiving end of them, while working as part of the Election Integrity Partnership, co-run by Kate Starbird, a computer scientist at the University of Washington in Seattle, who has also been a target of smear campaigns1. In March 2021, the team issued a public report documenting instances of viral false and misleading narratives that were circulating online during the 2020 US presidential election (see go.nature.com/472ney8).

    In late 2022, statements from that report were twisted by right-leaning social-media influencers, who put forward a fantastical story about how academics, social-media companies and the US Department of Homeland Security had colluded to skew the 2020 election by taking down “millions” of social-media posts — an alleged act of mass censorship (see go.nature.com/3ak4ih0).

    In reality, DiResta explains, the study’s aim was not to censor partisan statements, but to fact-check misleading statements about the electoral process in general. Only a small proportion of the posts that contained blatant election disinformation were flagged by the project to social-media companies for further action — about 0.01% of the 22 million posts in the sample. Fewer than 400 were eventually taken down for violating the platform’s terms of service.

    Nonetheless, DiResta became the subject of rumours and conspiracy theories, including that she had undisclosed ties to the US Central Intelligence Agency, on the basis that she had done an internship there 20 years before.

    Of the multiple lawsuits that have been filed against her since these online rumours surfaced, one case was dismissed in June by the US Supreme Court for having no legal standing. As I read the book, much of it resonated with my own experience. I’ve found myself facing online accusations of being part of a government conspiracy, for example, for helping the US State Department to educate citizens to spot common techniques used in disinformation campaigns. As the online attacks continued, the motivation behind my research was misrepresented and harassment campaigns were launched against me, my colleagues and even my students. It was stranger than fiction.

    However, I also wondered about the role of another class of actors in the fantasy–industrial complex: the apologists. Think of doctors with a specialty in another medical domain who question the efficacy of vaccines or philosophers who weaponize postmodern principles to question whether an identifiable category called ‘misinformation’ even exists. DiResta overlooks them, but academics who are congenial to the messages promoted by influencers can provide troubling intellectual cover for anti-scientific claims.

    Prevention better than cure

    In terms of solutions, DiResta offers a nuanced discussion on the role of free speech, content moderation and education in our fractured media landscape. One suggestion is to give power back to the people and let audiences decide how much moderation and algorithmic ranking they want in their social-media feeds. Other ideas include teaching the public about the techniques of propaganda, because those techniques can be used by anyone.

    DiResta also offers an important tip for scientists facing political threats: instead of sticking your head in the sand, pre-emptively release the facts and prebunk falsities before an alternative reality begins to take on a life of its own. This coheres with what I know about fighting misinformation: prevention is better than cure. But to fix our societal ills, people need to share the same reality. DiResta’s book offers a powerful and compelling read on how we might achieve just that.

    [ad_2]

    Source link

  • How to change people’s minds about climate change: what the science says

    How to change people’s minds about climate change: what the science says

    [ad_1]

    Nature, Published online: 06 September 2024; doi:10.1038/d41586-024-02777-9

    Telling people about the consensus among scientists can help, study finds, but experts think that personal conversations are needed, too.

    [ad_2]

    Source link

  • Loss of plasticity in deep continual learning

    [ad_1]

    Specifics of continual backpropagation

    Continual backpropagation selectively reinitializes low-utility units in the network. Our utility measure, called the contribution utility, is defined for each connection or weight and each unit. The basic intuition behind the contribution utility is that the magnitude of the product of units’ activation and outgoing weight gives information about how valuable this connection is to its consumers. If the contribution of a hidden unit to its consumer is small, its contribution can be overwhelmed by contributions from other hidden units. In such a case, the hidden unit is not useful to its consumer. We define the contribution utility of a hidden unit as the sum of the utilities of all its outgoing connections. The contribution utility is measured as a running average of instantaneous contributions with a decay rate, η, which is set to 0.99 in all experiments. In a feed-forward neural network, the contribution utility, ul[i], of the ith hidden unit in layer l at time t is updated as

    $${{\bf{u}}}_{l}[i]=\eta \times {{\bf{u}}}_{l}[i]+(1-\eta )\times | {{\bf{h}}}_{l,i,t}| \times \mathop{\sum }\limits_{k=1}^{{n}_{l+1}}| {{\bf{w}}}_{l,i,k,t}| ,$$

    (1)

    in which hl,i,t is the output of the ith hidden unit in layer l at time t, wl,i,k,t is the weight connecting the ith unit in layer l to the kth unit in layer l + 1 at time t and nl+1 is the number of units in layer l + 1.

    When a hidden unit is reinitialized, its outgoing weights are initialized to zero. Initializing the outgoing weights as zero ensures that the newly added hidden units do not affect the already learned function. However, initializing the outgoing weight to zero makes the new unit vulnerable to immediate reinitialization, as it has zero utility. To protect new units from immediate reinitialization, they are protected from a reinitialization for maturity threshold m number of updates. We call a unit mature if its age is more than m. Every step, a fraction of mature units ρ, called the replacement rate, is reinitialized in every layer.

    The replacement rate ρ is typically set to a very small value, meaning that only one unit is replaced after hundreds of updates. For example, in class-incremental CIFAR-100 (Fig. 2) we used continual backpropagation with a replacement rate of 10−5. The last layer of the network in that problem had 512 units. At each step, roughly 512 × 10−5 = 0.00512 units are replaced. This corresponds to roughly one replacement after every 1/0.00512 ≈ 200 updates or one replacement after every eight epochs on the first five classes.

    The final algorithm combines conventional backpropagation with selective reinitialization to continually inject random units from the initial distribution. Continual backpropagation performs a gradient descent and selective reinitialization step at each update. Algorithm 1 specifies continual backpropagation for a feed-forward neural network. In cases in which the learning system uses mini-batches, the instantaneous contribution utility can be used by averaging the utility over the mini-batch instead of keeping a running average to save computation (see Extended Data Fig. 5d for an example). Continual backpropagation overcomes the limitation of previous work34,35 on selective reinitialization and makes it compatible with modern deep learning.

    Algorithm 1

    Continual backpropagation for a feed-forward network with L layers

    Set replacement rate ρ, decay rate η and maturity threshold m

    Initialize the weights w0,…, wL−1, in which wl is sampled from distribution dl

    Initialize utilities u1,…, uL−1, number of units to replace c1,…, cL−1, and ages a1,…, aL−1 to 0

    For each input xt do

    Forward pass: pass xt through the network to get the prediction \(\widehat{{{\bf{y}}}_{t}}\)

    Evaluate: receive loss \(l({{\bf{x}}}_{t},\widehat{{{\bf{y}}}_{t}})\)

    Backward pass: update the weights using SGD or one of its variants

    For layer l in 1: L − 1 do

    Update age: al = al + 1

    Update unit utility: see equation (1)

    Find eligible units: neligible = number of units with age greater than m

    Update number of units to replace: cl = cl + neligible × ρ

    If cl > 1

    Find the unit with smallest utility and record its index as r

    Reinitialize input weights: resample wl−1[:,r] from distribution dl

    Reinitialize output weights: set wl[r,:] to 0

    Reinitialize utility and age: set ul[r] = 0 and al[r] = 0

    Update number of units to replace: cl = cl − 1

    End For

    End For

    Details of Continual ImageNet

    The ImageNet database we used consists of 1,000 classes, each of 700 images. The 700 images for each class were divided into 600 images for a training set and 100 images for a test set. On each binary classification task, the deep-learning network was first trained on the training set of 1,200 images and then its classification accuracy was measured on the test set of 200 images. The training consisted of several passes through the training set, called epochs. For each task, all learning algorithms performed 250 passes through the training set using mini-batches of size 100. All tasks used the downsampled 32 × 32 version of the ImageNet dataset, as is often done to save computation51.

    All algorithms on Continual ImageNet used a convolutional network. The network had three convolutional-plus-max-pooling layers, followed by three fully connected layers, as detailed in Extended Data Table 3. The final layer consisted of just two units, the heads, corresponding to the two classes. At task changes, the input weights of the heads were reset to zero. Resetting the heads in this way can be viewed as introducing new heads for the new tasks. This resetting of the output weights is not ideal for studying plasticity, as the learning system gets access to privileged information on the timing of task changes (and we do not use it in other experiments in this paper). We use it here because it is the standard practice in deep continual learning for this type of problem in which the learning system has to learn a sequence of independent tasks52.

    In this problem, we reset the head of the network at the beginning of each task. It means that, for a linear network, the whole network is reset. That is why the performance of a linear network will not degrade in Continual ImageNet. As the linear network is a baseline, having a low-variance estimate of its performance is desirable. The value of this baseline is obtained by averaging over thousands of tasks. This averaging gives us a much better estimate of its performance than other networks.

    The network was trained using SGD with momentum on the cross-entropy loss and initialized once before the first task. The momentum hyperparameter was 0.9. We tested various step-size parameters for backpropagation but only presented the performance for step sizes 0.01, 0.001 and 0.0001 for clarity of Fig. 1b. We performed 30 runs for each hyperparameter value, varying the sequence of tasks and other randomness. Across different hyperparameters and algorithms, the same sequences of pairs of classes were used.

    We now describe the hyperparameter selection for L2 regularization, Shrink and Perturb and continual backpropagation. The main text presents the results for these algorithms on Continual ImageNet in Fig. 1c. We performed a grid search for all algorithms to find the set of hyperparameters that had the highest average classification accuracy over 5,000 tasks. The values of hyperparameters used for the grid search are described in Extended Data Table 2. L2 regularization has two hyperparameters, step size and weight decay. Shrink and Perturb has three hyperparameters, step size, weight decay and noise variance. We swept over two hyperparameters of continual backpropagation: step size and replacement rate. The maturity threshold in continual backpropagation was set to 100. For both backpropagation and L2 regularization, the performance was poor for step sizes of 0.1 or 0.003. We chose to only use step sizes of 0.03 and 0.01 for continual backpropagation and Shrink and Perturb. We performed ten independent runs for all sets of hyperparameters. Then we performed another 20 runs to complete 30 runs for the best-performing set of hyperparameters to produce the results in Fig. 1c.

    Class-incremental CIFAR-100

    In the class-incremental CIFAR-100, the learning system gets access to more and more classes over time. Classes are provided to the learning system in increments of five. First, it has access to just five classes, then ten and so on, until it gets access to all 100 classes. The learning system is evaluated on the basis of how well it can discriminate between all the available classes at present. The dataset consists of 100 classes with 600 images each. The 600 images for each class were divided into 450 images to create a training set, 50 for a validation set and 100 for a test set. Note that the network is trained on all data from all classes available at present. First, it is trained on data from just five classes, then from all ten classes and so on, until finally, it is trained from data from all 100 classes simultaneously.

    After each increment, the network was trained for 200 epochs, for a total of 4,000 epochs for all 20 increments. We used a learning-rate schedule that resets at the start of each increment. For the first 60 epochs of each increment, the learning rate was set to 0.1, then to 0.02 for the next 60 epochs, then 0.004 for the next 40 epochs and to 0.0008 for the last 40 epochs; we used the initial learning rate and learning-rate schedule reported in ref. 53. During the 200 epochs of training for each increment, we kept track of the network with the best accuracy on the validation set. To prevent overfitting, at the start of each new increment, we reset the weights of the network to the weights of the best-performing (on the validation set) network found during the previous increment; this is equivalent to early stopping for each different increment.

    We used an 18-layer deep residual network38 for all experiments on class-incremental CIFAR-100. The network architecture is described in detail in Extended Data Table 1. The weights of convolutional and linear layers were initialized using Kaiming initialization54, the weights for the batch-norm layers were initialized to one and all of the bias terms in the network were initialized to zero. Each time five new classes were made available to the network, five more output units were added to the final layer of the network. The weights and biases of these output units were initialized using the same initialization scheme. The weights of the network were optimized using SGD with a momentum of 0.9, a weight decay of 0.0005 and a mini-batch size of 90.

    We used several steps of data preprocessing before the images were presented to the network. First, the value of all the pixels in each image was rescaled between 0 and 1 through division by 255. Then, each pixel in each channel was centred and rescaled by the average and standard deviation of the pixel values of each channel, respectively. Finally, we applied three random data transformations to each image before feeding it to the network: randomly horizontally flip the image with a probability of 0.5, randomly crop the image by padding the image with 4 pixels on each side and randomly cropping to the original size, and randomly rotate the image between 0 and 15°. The first two steps of preprocessing were applied to the training, validation and test sets, but the random transformations were only applied to the images in the training set.

    We tested several hyperparameters to ensure the best performance for each different algorithm with our specific architecture. For the base system, we tested values for the weight decay parameter in {0.005, 0.0005, 0.00005}. A weight-decay value of 0.0005 resulted in the best performance in terms of area under the curve for accuracy on the test set over the 20 increments. For Shrink and Perturb, we used the weight-decay value of the base system and tested values for the standard deviation of the Gaussian noise in {10−4, 10−5, 10−6}; 10−5 resulted in the best performance. For continual backpropagation, we tested values for the maturity threshold in {1,000, 10,000} and for the replacement rate in {10−4, 10−5, 10−6} using the contribution utility described in equation (1). A maturity threshold of 1,000 and a replacement rate of 10−5 resulted in the best performance. Finally, for the head-resetting baseline, in Extended Data Fig. 1a, we used the same hyperparameters as for the base system, but the output layer was reinitialized at the start of each increment.

    In Fig. 2d, we plot the stable rank of the representation in the penultimate layer of the network and the percentage of dead units in the full network. For a matrix \({\boldsymbol{\Phi }}\in {{\mathbb{R}}}^{n\times m}\) with singular values σk sorted in descending order for k = 1, 2,…, q and q = max(n, m), the stable rank55 is \(\min \left\{k:\frac{{\Sigma }_{i}^{k}{\sigma }_{i}}{{\Sigma }_{j}^{q}{\sigma }_{j}} > 0.99\right\}\).

    For reference, we also implemented a network with the same hyperparameters as the base system but that was reinitialized at the beginning of each increment. Figure 2b shows the performance of each algorithm relative to the performance of the reinitialized network. For completeness, Extended Data Fig. 1a shows the test accuracy of each algorithm in each different increment. The final accuracy of continual backpropagation on all 100 classes was 76.13%, whereas Extended Data Fig. 1b shows the performance of continual backpropagation for different replacement rates with a maturity threshold of 1,000. For all algorithms that we tested, there was no correlation between when a class was presented and the accuracy of that class, implying that the temporal order of classes did not affect performance.

    Robust loss of plasticity in permuted MNIST

    We now use a computationally cheap problem based on the MNIST dataset56 to test the generality of loss of plasticity across various conditions. MNIST is one of the most common supervised-learning datasets used in deep learning. It consists of 60,000, 28 × 28, greyscale images of handwritten digits from 0 to 9, together with their correct labels. For example, the left image in Extended Data Fig. 3a shows an image that is labelled by the digit 7. The smaller number of classes and the simpler images enable much smaller networks to perform well on this dataset than are needed on ImageNet or CIFAR-100. The smaller networks in turn mean that much less computation is needed to perform the experiments and thus experiments can be performed in greater quantities and under a variety of different conditions, enabling us to perform deeper and more extensive studies of plasticity.

    We created a continual supervised-learning problem using permuted MNIST datasets57,58. An individual permuted MNIST dataset is created by permuting the pixels in the original MNIST dataset. The right image in Extended Data Fig. 3a is an example of such a permuted image. Given a way of permuting, all 60,000 images are permuted in the same way to produce the new permuted MNIST dataset. Furthermore, we normalized pixel values between 0 and 1 by dividing by 255.

    By repeatedly randomly selecting from the approximately 101930 possible permutations, we created a sequence of 800 permuted MNIST datasets and supervised-learning tasks. For each task, we presented each of its 60,000 images one by one in random order to the learning network. Then we moved to the next permuted MNIST task and repeated the whole procedure, and so on for up to 800 tasks. No indication was given to the network at the time of task switching. With the pixels being permuted in a completely unrelated way, we might expect classification performance to fall substantially at the time of each task switch. Nevertheless, across tasks, there could be some savings, some improvement in speed of learning or, alternatively, there could be loss of plasticity—loss of the ability to learn across tasks. The network was trained on a single pass through the data and there were no mini-batches. We call this problem Online Permuted MNIST.

    We applied feed-forward neural networks with three hidden layers to Online Permuted MNIST. We did not use convolutional layers, as they could not be helpful on the permuted problem because the spatial information is lost; in MNIST, convolutional layers are often not used even on the standard, non-permuted problem. For each example, the network estimated the probabilities of each of the tem classes, compared them to the correct label and performed SGD on the cross-entropy loss. As a measure of online performance, we recorded the percentage of times the network correctly classified each of the 60,000 images in the task. We plot this per-task performance measure versus task number in Extended Data Fig. 3b. The weights were initialized according to a Kaiming distribution.

    The left panel of Extended Data Fig. 3b shows the progression of online performance across tasks for a network with 2,000 units per layer and various values of the step-size parameter. Note that that performance first increased across tasks, then began falling steadily across all subsequent tasks. This drop in performance means that the network is slowly losing the ability to learn from new tasks. This loss of plasticity is consistent with the loss of plasticity that we observed in ImageNet and CIFAR-100.

    Next, we varied the network size. Instead of 2,000 units per layer, we tried 100, 1,000 and 10,000 units per layer. We ran this experiment for only 150 tasks, primarily because the largest network took much longer to run. The performances at good step sizes for each network size are shown in the middle panel of Extended Data Fig. 3b. Loss of plasticity with continued training is most pronounced at the smaller network sizes, but even the largest networks show some loss of plasticity.

    Next, we studied the effect of the rate at which the task changed. Going back to the original network with 2,000-unit layers, instead of changing the permutation after each 60,000 examples, we now changed it after each 10,000, 100,000 or 1 million examples and ran for 48 million examples in total no matter how often the task changed. The examples in these cases were selected randomly with replacement for each task. As a performance measure of the network on a task, we used the percentage correct over all of the images in the task. The progression of performance is shown in the right panel in Extended Data Fig. 3b. Again, performance fell across tasks, even if the change was very infrequent. Altogether, these results show that the phenomenon of loss of plasticity robustly arises in this form of backpropagation. Loss of plasticity happens for a wide range of step sizes, rates of distribution change and for both underparameterized and overparameterized networks.

    Loss of plasticity with different activations in the Slowly-Changing Regression problem

    There remains the issue of the network’s activation function. In our experiments so far, we have used ReLU, the most popular choice at present, but there are several other possibilities. For these experiments, we switched to an even smaller, more idealized problem. Slowly-Changing Regression is a computationally inexpensive problem in which we can run a single experiment on a CPU core in 15 min, allowing us to perform extensive studies. As its name suggests, this problem is a regression problem—meaning that the labels are real numbers, with a squared loss, rather than nominal values with a cross-entropy loss—and the non-stationarity is slow and continual rather than abrupt, as in a switch from one task to another. In Slowly-Changing Regression, we study loss of plasticity for networks with six popular activation functions: sigmoid, tanh, ELU59, leaky ReLU60, ReLU61 and Swish62.

    In Slowly-Changing Regression, the learner receives a sequence of examples. The input for each example is a binary vector of size m + 1. The input has f slow-changing bits, m − f random bits and then one constant bit. The first f bits in the input vector change slowly. After every T examples, one of the first f bits is chosen uniformly at random and its value is flipped. These first f bits remain fixed for the next T examples. The parameter T allows us to control the rate at which the input distribution changes. The next m − f bits are randomly sampled for each example. Last, the (m + 1)th bit is a bias term with a constant value of one.

    The target output is generated by running the input vector through a neural network, which is set at the start of the experiment and kept fixed. As this network generates the target output and represents the desired solution, we call it the target network. The weights of the target networks are randomly chosen to be +1 or −1. The target network has one hidden layer with the linear threshold unit (LTU) activation. The value of the ith LTU is one if the input is above a threshold θi and 0 otherwise. The threshold θi is set to be equal to (m + 1) × β − Si, in which β [0, 1] and Si is the number of input weights with negative value63. The details of the input and target function in the Slowly-Changing Regression problem are also described in Extended Data Fig. 2a.

    The details of the specific instance of the Slowly-Changing Regression problem we use in this paper and the learning network used to predict its output are listed in Extended Data Table 4. Note that the target network is more complex than the learning network, as the target network is wider, with 100 hidden units, whereas the learner has just five hidden units. Thus, because the input distribution changes every T example and the target function is more complex than what the learner can represent, there is a need to track the best approximation.

    We applied learning networks with different activation functions to the Slowly-Changing Regression. The learner used the backpropagation algorithm to learn the weights of the network. We used a uniform Kaiming distribution54 to initialize the weights of the learning network. The distribution is U(−b, b) with bound, \(b={\rm{g}}{\rm{a}}{\rm{i}}{\rm{n}}\times \sqrt{\frac{3}{{\rm{n}}{\rm{u}}{\rm{m}}{\rm{\_}}{\rm{i}}{\rm{n}}{\rm{p}}{\rm{u}}{\rm{t}}{\rm{s}}}}\), in which gain is chosen such that the magnitude of inputs does not change across layers. For tanh, sigmoid, ReLU and leaky ReLU, the gain is 5/3, 1, \(\sqrt{2}\) and \(\sqrt{2/(1+{\alpha }^{2})}\), respectively. For ELU and Swish, we used \({\rm{gain}}=\sqrt{2}\), as was done in the original papers59,62.

    We ran the experiment on the Slowly-Changing Regression problem for 3 million examples. For each activation and value of step size, we performed 100 independent runs. First, we generated 100 sequences of examples (input–output pairs) for the 100 runs. Then these 100 sequences of examples were used for experiments with all activations and values of the step-size parameter. We used the same sequence of examples to control the randomness in the data stream across activations and step sizes.

    The results of the experiments are shown in Extended Data Fig. 2b. We measured the squared error, that is, the square of the difference between the true target and the prediction made by the learning network. In Extended Data Fig. 2b, the squared error is presented in bins of 40,000 examples. This means that the first data point is the average squared error on the first 40,000 examples, the next is the average squared error on the next 40,000 examples and so on. The shaded region in the figure shows the standard error of the binned error.

    Extended Data Fig. 2b shows that, in Slowly-Changing Regression, after performing well initially, the error increases for all step sizes and activations. For some activations such as ReLU and tanh, loss of plasticity is severe, and the error increases to the level of the linear baseline. Although for other activations such as ELU loss of plasticity is less severe, there is still a notable loss of plasticity. These results mean that loss of plasticity is not resolved by using commonly used activations. The results in this section complement the results in the rest of the article and add to the generality of loss of plasticity in deep learning.

    Understanding loss of plasticity

    We now turn our attention to understanding why backpropagation loses plasticity in continual-learning problems. The only difference in the learner over time is the network weights. In the beginning, the weights were small random numbers, as they were sampled from the initial distribution; however, after learning some tasks, the weights became optimized for the most recent task. Thus, the starting weights for the next task are qualitatively different from those for the first task. As this difference in the weights is the only difference in the learning algorithm over time, the initial weight distribution must have some unique properties that make backpropagation plastic in the beginning. The initial random distribution might have many properties that enable plasticity, such as the diversity of units, non-saturated units, small weight magnitude etc.

    As we now demonstrate, many advantages of the initial distribution are lost concurrently with loss of plasticity. The loss of each of these advantages partially explains the degradation in performance that we have observed. We then provide arguments for how the loss of these advantages could contribute to loss of plasticity and measures that quantify the prevalence of each phenomenon. We provide an in-depth study of the Online Permuted MNIST problem that will serve as motivation for several solution methods that could mitigate loss of plasticity.

    The first noticeable phenomenon that occurs concurrently with the loss of plasticity is the continual increase in the fraction of constant units. When a unit becomes constant, the gradients flowing back from the unit become zero or very close to zero. Zero gradients mean that the weights coming into the unit do not change, which means that this unit loses all of its plasticity. In the case of ReLU activations, this occurs when the output of the activations is zero for all examples of the task; such units are often said to be dead64,65. In the case of the sigmoidal activation functions, this phenomenon occurs when the output of a unit is too close to either of the extreme values of the activation function; such units are often said to be saturated66,67.

    To measure the number of dead units in a network with ReLU activation, we count the number of units with a value of zero for all examples in a random sample of 2,000 images at the beginning of each new task. An analogous measure in the case of sigmoidal activations is the number of units that are ϵ away from either of the extreme values of the function for some small positive ϵ (ref. 68). We only focus on ReLU networks in this section.

    In our experiments on the Online Permuted MNIST problem, the deterioration of online performance is accompanied by a large increase in the number of dead units (left panel of Extended Data Fig. 3c). For the step size of 0.01, up to 25% of units die after 800 tasks. In the permuted MNIST problem, in which all inputs are positive because they are normalized between 0 and 1, once a unit in the first layer dies, it stays dead forever. Thus, an increase in dead units directly decreases the total capacity of the network. In the next section, we will see that methods that stop the units from dying can substantially reduce loss of plasticity. This further supports the idea that the increase in dead units is one of the causes of loss of plasticity in backpropagation.

    Another phenomenon that occurs with loss of plasticity is the steady growth of the network’s average weight magnitude. We measure the average magnitude of the weights by adding up their absolute values and dividing by the total number of weights in the network. In the permuted MNIST experiment, the degradation of online classification accuracy of backpropagation observed in Extended Data Fig. 3b is associated with an increase in the average magnitude of the weights (centre panel of Extended Data Fig. 3c). The growth of the magnitude of the weights of the network can represent a problem because large weight magnitudes are often associated with slower learning. The weights of a neural network are directly linked to the condition number of the Hessian matrix in the second-order Taylor approximation of the loss function. The condition number of the Hessian is known to affect the speed of convergence of SGD algorithms (see ref. 69 for an illustration of this phenomenon in convex optimization). Consequently, the growth in the magnitude of the weights could lead to an ill-conditioned Hessian matrix, resulting in a slower convergence.

    The last phenomenon that occurs with the loss of plasticity is the drop in the effective rank of the representation. Similar to the rank of a matrix, which represents the number of linearly independent dimensions, the effective rank takes into consideration how each dimension influences the transformation induced by a matrix70. A high effective rank indicates that most of the dimensions of the matrix contribute similarly to the transformation induced by the matrix. On the other hand, a low effective rank corresponds to most dimensions having no notable effect on the transformation, implying that the information in most of the dimensions is close to being redundant.

    Formally, consider a matrix \(\Phi \in {{\mathbb{R}}}^{n\times m}\) with singular values σk for k = 1, 2,…, q, and q = max(n, m). Let pk = σk/σ1, in which σ is the vector containing all the singular values and 1 is the 1 norm. The effective rank of matrix Φ, or erank(Φ), is defined as

    $$\begin{array}{l}{\rm{e}}{\rm{r}}{\rm{a}}{\rm{n}}{\rm{k}}({\boldsymbol{\Phi }})\dot{=}\exp \{H({p}_{1},{p}_{2},…,{p}_{q})\},\\ {\rm{in\; which}}\,H({p}_{1},{p}_{2},…,{p}_{q})=-\mathop{\sum }\limits_{k=1}^{q}{p}_{k}\log ({p}_{k}).\end{array}$$

    (2)

    Note that the effective rank is a continuous measure that ranges between one and the rank of matrix Φ.

    In the case of neural networks, the effective rank of a hidden layer measures the number of units that can produce the output of the layer. If a hidden layer has a low effective rank, then a small number of units can produce the output of the layer, meaning that many of the units in the hidden layer are not providing any useful information. We approximate the effective rank on a random sample of 2,000 examples before training on each task.

    In our experiments, loss of plasticity is accompanied by a decrease in the average effective rank of the network (right panel of Extended Data Fig. 3c). This phenomenon in itself is not necessarily a problem. After all, it has been shown that gradient-based optimization seems to favour low-rank solutions through implicit regularization of the loss function or implicit minimization of the rank itself71,72. However, a low-rank solution might be a bad starting point for learning from new observations because most of the hidden units provide little to no information.

    The decrease in effective rank could explain the loss of plasticity in our experiments in the following way. After each task, the learning algorithm finds a low-rank solution for the current task, which then serves as the initialization for the next task. As the process continues, the effective rank of the representation layer keeps decreasing after each task, limiting the number of solutions that the network can represent immediately at the start of each new task.

    In this section, we looked deeper at the networks that lost plasticity in the Online Permuted MNIST problem. We noted that the only difference in the learning algorithm over time is the weights of the network, which means that the initial weight distribution has some properties that allowed the learning algorithm to be plastic in the beginning. And as learning progressed, the weights of the network moved away from the initial distribution and the algorithm started to lose plasticity. We found that loss of plasticity is correlated with an increase in weight magnitude, a decrease in the effective rank of the representation and an increase in the fraction of dead units. Each of these correlates partially explains loss of plasticity faced by backpropagation.

    Existing deep-learning methods for mitigating loss of plasticity

    We now investigate several existing methods and test how they affect loss of plasticity. We study five existing methods: L2 regularization73, Dropout74, online normalization75, Shrink and Perturb11 and Adam43. We chose L2 regularization, Dropout, normalization and Adam because these methods are commonly used in deep-learning practice. Although Shrink and Perturb is not a commonly used method, we chose it because it reduces the failure of pretraining, a problem that is an instance of loss of plasticity. To assess if these methods can mitigate loss of plasticity, we tested them on the Online Permuted MNIST problem using the same network architecture we used in the previous section, ‘Understanding loss of plasticity’. Similar to the previous section, we measure the online classification accuracy on all 60,000 examples of the task. All the algorithms used a step size of 0.003, which was the best-performing step size for backpropagation in the left panel of Extended Data Fig. 3b. We also use the three correlates of loss of plasticity found in the previous section to get a deeper understanding of the performance of these methods.

    An intuitive way to address loss of plasticity is to use weight regularization, as loss of plasticity is associated with a growth of weight magnitudes, shown in the previous section. We used L2 regularization, which adds a penalty to the loss function proportional to the 2 norm of the weights of the network. The L2 regularization penalty incentivizes SGD to find solutions that have a low weight magnitude. This introduces a hyperparameter λ that modulates the contribution of the penalty term.

    The purple line in the left panel of Extended Data Fig. 4a shows the performance of L2 regularization on the Online Permuted MNIST problem. The purple lines in the other panels of Extended Data Fig. 4a show the evolution of the three correlates of loss of plasticity with L2 regularization. For L2 regularization, the weight magnitude does not continually increase. Moreover, as expected, the non-increasing weight magnitude is associated with lower loss of plasticity. However, L2 regularization does not fully mitigate loss of plasticity. The other two correlates for loss of plasticity explain this, as the percentage of dead units kept increasing and the effective rank kept decreasing. Finally, Extended Data Fig. 4b shows the performance of L2 regularization for different values of λ. The regularization parameter λ controlled the peak of the performance and how quickly it decreased.

    A method related to weight regularization is Shrink and Perturb11. As the name suggests, Shrink and Perturb performs two operations; it shrinks all the weights and then adds random Gaussian noise to these weights. The introduction of noise introduces another hyperparameter, the standard deviation of the noise. Owing to the shrinking part of Shrink and Perturb, the algorithm favours solutions with smaller average weight magnitude than backpropagation. Moreover, the added noise prevents units from dying because it adds a non-zero probability that a dead unit will become active again. If Shrink and Perturb mitigates these correlates to loss of plasticity, it could reduce loss of plasticity.

    The performance of Shrink and Perturb is shown in orange in Extended Data Fig. 4. Similar to L2 regularization, Shrink and Perturb stops the weight magnitude from continually increasing. Moreover, it also reduces the percentage of dead units. However, it has a lower effective rank than backpropagation, but still higher than that of L2 regularization. Not only does Shrink and Perturb have a lower loss of plasticity than backpropagation but it almost completely mitigates loss of plasticity in Online Permuted MNIST. However, Shrink and Perturb was sensitive to the standard deviation of the noise. If the noise was too high, loss of plasticity was much more severe, and if it was too low, it did not have any effect.

    An important technique in modern deep learning is called Dropout74. Dropout randomly sets each hidden unit to zero with a small probability, which is a hyperparameter of the algorithm. The performance of Dropout is shown in pink in Extended Data Fig. 4.

    Dropout showed similar measures of percentage of dead units, weight magnitude and effective rank as backpropagation, but, surprisingly, showed higher loss of plasticity. The poor performance of Dropout is not explained by our three correlates of loss of plasticity, which means that there are other possible causes of loss of plasticity. A thorough investigation of Dropout is beyond the scope of this paper, though it would be an interesting direction for future work. We found that a higher Dropout probability corresponded to a faster and sharper drop in performance. Dropout with probability of 0.03 performed the best and its performance was almost identical to that of backpropagation. However, Extended Data Fig. 4a shows the performance for a Dropout probability of 0.1 because it is more representative of the values used in practice.

    Another commonly used technique in deep learning is batch normalization76. In batch normalization, the output of each hidden layer is normalized and rescaled using statistics computed from each mini-batch of data. We decided to include batch normalization in this investigation because it is a popular technique often used in practice. Because batch normalization is not amenable to the online setting used in the Online Permuted MNIST problem, we used online normalization77 instead, an online variant of batch normalization. Online normalization introduces two hyperparameters used for the incremental estimation of the statistics in the normalization steps.

    The performance of online normalization is shown in green in Extended Data Fig. 4. Online normalization had fewer dead units and a higher effective rank than backpropagation in the earlier tasks, but both measures deteriorated over time. In the later tasks, the network trained using online normalization has a higher percentage of dead units and a lower effective rank than the network trained using backpropagation. The online classification accuracy is consistent with these results. Initially, it has better classification accuracy, but later, its classification accuracy becomes lower than that of backpropagation. For online normalization, the hyperparameters changed when the performance of the method peaked, and it also slightly changed how fast it got to its peak performance.

    No assessment of alternative methods can be complete without Adam43, as it is considered one of the most useful tools in modern deep learning. The Adam optimizer is a variant of SGD that uses an estimate of the first moment of the gradient scaled inversely by an estimate of the second moment of the gradient to update the weights instead of directly using the gradient. Because of its widespread use and success in both supervised and reinforcement learning, we decided to include Adam in this investigation to see how it would affect the plasticity of deep neural networks. Adam has two hyperparameters that are used for computing the moving averages of the first and second moments of the gradient. We used the default values of these hyperparameters proposed in the original paper and tuned the step-size parameter.

    The performance of Adam is shown in cyan in Extended Data Fig. 4. Adam’s loss of plasticity can be categorized as catastrophic, as it reduces substantially. Consistent with our previous results, Adam scores poorly in the three measures corresponding to the correlates of loss of plasticity. Adam had an early increase in the percentage of dead units that plateaus at around 60%, similar weight magnitude as backpropagation and a large drop in the effective rank early during training. We also tested Adam with different activation functions on the Slowly-Changing Regression and found that loss of plasticity with Adam is usually worse than with SGD.

    Many of the standard methods substantially worsened loss of plasticity. The effect of Adam on the plasticity of the networks was particularly notable. Networks trained with Adam quickly lost almost all of their diversity, as measured by the effective rank, and gained a large percentage of dead units. This marked loss of plasticity of Adam is an important result for deep reinforcement learning, for which Adam is the default optimizer78, and reinforcement learning is inherently continual owing to the ever-changing policy. Similar to Adam, other commonly used methods such as Dropout and normalization worsened loss of plasticity. Normalization had better performance in the beginning, but later it had a sharper drop in performance than backpropagation. In the experiment, Dropout simply made the performance worse. We saw that the higher the Dropout probability, the larger the loss of plasticity. These results mean that some of the most successful tools in deep learning do not work well in continual learning, and we need to focus on directly developing tools for continual learning.

    We did find some success in maintaining plasticity in deep neural networks. L2 regularization and Shrink and Perturb reduce loss of plasticity. Shrink and Perturb is particularly effective, as it almost entirely mitigates loss of plasticity. However, both Shrink and Perturb and L2 regularization are slightly sensitive to hyperparameter values. Both methods only reduce loss of plasticity for a small range of hyperparameters, whereas for other hyperparameter values, they make loss of plasticity worse. This sensitivity to hyperparameters can limit the application of these methods to continual learning. Furthermore, Shrink and Perturb does not fully resolve the three correlates of loss of plasticity, it has a lower effective rank than backpropagation and it still has a high fraction of dead units.

    We also applied continual backpropagation on Online Permuted MNIST. The replacement rate is the main hyperparameter in continual backpropagation, as it controls how rapidly units are reinitialized in the network. For example, a replacement rate of 10−6 for our network with 2,000 hidden units in each layer would mean replacing one unit in each layer after every 500 examples.

    Blue lines in Extended Data Fig. 4 show the performance of continual backpropagation. It has a non-degrading performance and is stable for a wide range of replacement rates. Continual backpropagation also mitigates all three correlates of loss of plasticity. It has almost no dead units, stops the network weights from growing and maintains a high effective rank across tasks. All algorithms that maintain a low weight magnitude also reduced loss of plasticity. This supports our claim that low weight magnitudes are important for maintaining plasticity. The algorithms that maintain low weight magnitudes were continual backpropagation, L2 regularization and Shrink and Perturb. Shrink and Perturb and continual backpropagation have an extra advantage over L2 regularization: they inject randomness into the network. This injection of randomness leads to a higher effective rank and lower number of dead units, which leads to both of these algorithms performing better than L2 regularization. However, continual backpropagation injects randomness selectively, effectively removing all dead units from the network and leading to a higher effective rank. This smaller number of dead units and a higher effective rank explains the better performance of continual backpropagation.

    Details and further analysis in reinforcement learning

    The experiments presented in the main text were conducted using the Ant-v3 environment from OpenAI Gym79. We changed the coefficient of friction by sampling it log-uniformly from the range [0.02, 2.00], using a logarithm with base 10. The coefficient of friction changed at the first episode boundary after 2 million time steps had passed since the last change. We also tested Shrink and Perturb on this problem and found that it did not provide a marked performance improvement over L2 regularization. Two separate networks were used for the policy and the value function, and both had two hidden layers with 256 units. These networks were trained using Adam alongside PPO to update the weights in the network. See Extended Data Table 5 for the values of the other hyperparameters. In all of the plots showing results of reinforcement-learning experiments, the shaded region represents the 95% bootstrapped confidence80.

    The reward signal in the ant problem consists of four components. The main component rewards the agent for forward movement. It is proportional to the distance moved by the ant in the positive x direction since the last time step. The second component has a value of 1 at each time step. The third component penalizes the ant for taking large actions. This component is proportional to the square of the magnitude of the action. Finally, the last component penalizes the agent for large external contact forces. It is proportional to the sum of external forces (clipped in a range). The reward signal at each time step is the sum of these four components.

    We also evaluated PPO and its variants in two more environments: Hopper-v3 and Walker-v3. The results for these experiments are presented in Extended Data Fig. 5a. The results mirrored those from Ant-v3; standard PPO suffered from a notable degradation in performance, in which its performance decreased substantially. However, this time, L2 regularization did not fix the issue in all cases; there was some performance degradation with L2 in Walker-v3. PPO, with continual backpropagation and L2 regularization, completely fixed the issue in all environments. Note that the only difference between our experiments and what is typically done in the literature is that we run the experiments for longer. Typically, these experiments are only done for 3 million steps, but we ran these experiments for up to 100 million steps.

    PPO with L2 regularization only avoided degradation for a relatively large value of weight decay, 10−3. This extreme regularization stops the agent from finding better policies and stays stuck at a suboptimal policy. There was large performance degradation for smaller values of weight decay, and for larger values, the performance was always low. When we used continual backpropagation and L2 regularization together, we could use smaller values of weight decay. All the results for PPO with continual backpropagation and L2 regularization have a weight decay of 10−4, a replacement rate of 10−4 and a maturity threshold of 104. We found that the performance of PPO with continual backpropagation and L2 regularization was sensitive to the replacement rate but not to the maturity threshold and weight decay.

    PPO uses the Adam optimizer, which keeps running estimates of the gradient and the squared of the gradient. These estimates require two further parameters, called β1 and β2. The standard values of β1 and β2 are 0.9 and 0.999, respectively, which we refer to as standard Adam. Lyle et al.24 showed that the standard values of β1 and β2 cause a large loss of plasticity. This happens because of the mismatch in β1 and β2. A sudden large gradient can cause a very large update, as a large value of β2 means that the running estimate for the square of the gradient, which is used in the denominator, is updated much more slowly than the running estimate for the gradient, which is the numerator. This loss of plasticity in Adam can be reduced by setting β1 equal to β2. In our experiments, we set β1 and β2 to 0.99 and refer to it as tuned Adam/PPO. In Extended Data Fig. 5c, we measure the largest total weight change in the network during a single update cycle for bins of 1 million steps. The first point in the plots shows the largest weight change in the first 1 million steps. The second point shows the largest weight change in the second 1 second steps and so on. The figure shows that standard Adam consistently causes very large updates to the weights, which can destabilize learning, whereas tuned Adam with β1 = β2 = 0.99 has substantially smaller updates, which leads to more stable learning. In all of our experiments, all algorithms other than the standard PPO used the tuned parameters for Adam (β1 = β2 = 0.99). The failure of standard Adam with PPO is similar to the failure of standard Adam in permuted MNIST.

    In our next experiment, we perform a preliminary comparison with ReDo25. ReDo is another selective reinitialization method that builds on continual backpropagation but uses a different measure of utility and strategy for reinitializing. We tested ReDo on Ant-v3, the hardest of the three environments. ReDo requires two parameters: a threshold and a reinitialization period. We tested ReDo for all combinations of thresholds in {0.01, 0.03, 0.1} and reinitialization periods in {10, 102, 103, 104, 105}; a threshold of 0.1 with a reinitialization period of 102 performed the best. The performance of PPO with ReDo is plotted in Extended Data Fig. 5b. ReDo and continual backpropagation were used with weight decay of 10−4 and β1 and β2 of 0.99. The figure shows that PPO with ReDo and L2 regularization performs much better than standard PPO. However, it still suffers from performance degradation and its performance is worse than PPO with L2 regularization. Note that this is only a preliminary comparison; we leave a full comparison and analysis of both methods for future work.

    The performance drop of PPO in stationary environments is a nuanced phenomenon. Loss of plasticity and forgetting are both responsible for the observed degradation in performance. The degradation in performance implies that the agent forgot the good policy it had once learned, whereas the inability of the agent to relearn a good policy means it lost plasticity.

    Loss of plasticity expresses itself in various forms in deep reinforcement learning. Some work found that deep reinforcement learning systems can lose their generalization abilities in the presence of non-stationarities81. A reduction in the effective rank, similar to the rank reduction in CIFAR-100, has been observed in some deep reinforcement-learning algorithms82. Nikishin et al.18 showed that many reinforcement-learning systems perform better if their network is occasionally reset to its naive initial state, retaining only the replay buffer. This is because the learning networks became worse than a reinitialized network at learning from new data. Recent work has improved performance in many reinforcement-learning problems by applying plasticity-preserving methods25,83,84,85,86,87. These works focused on deep reinforcement learning systems that use large replay buffers. Our work complements this line of research as we studied systems based on PPO, which has much smaller replay buffers. Loss of plasticity is most relevant for systems that use small or no replay buffers, as large buffers can hide the effect of new data. Overcoming loss of plasticity is an important step towards deep reinforcement-learning systems that can learn from an online data stream.

    Extended discussion

    There are two main goals in continual learning: maintaining stability and maintaining plasticity88,89,90,91. Maintaining stability is concerned with memorizing useful information and maintaining plasticity is about finding new useful information when the data distribution changes. Current deep-learning methods struggle to maintain stability as they tend to forget previously learned information28,29. Many papers have been dedicated to maintaining stability in deep continual learning30,92,93,94,95,96,97. We focused on continually finding useful information, not on remembering useful information. Our work on loss of plasticity is different but complementary to the work on maintaining stability. Continual backpropagation in its current form does not tackle the forgetting problem. Its current utility measure only considers the importance of units for current data. One idea to tackle forgetting is to use a long-term measure of utility that remembers which units were useful in the past. Developing methods that maintain both stability and plasticity is an important direction for future work.

    There are many desirable properties for an efficient continual-learning system98,99. It should be able to keep learning new things, control what it remembers and forgets, have good computational and memory efficiency and use previous knowledge to speed up learning on new data. The choice of the benchmark affects which property is being focused on. Most benchmarks and evaluations in our paper only focused on plasticity but not on other aspects, such as forgetting and speed of learning. For example, in Continual ImageNet, previous tasks are rarely repeated, which makes it effective for studying plasticity but not forgetting. In permuted MNIST, consecutive tasks are largely independent, which makes it suitable for studying plasticity in isolation. However, this independence means that previous knowledge cannot substantially speed up learning on new tasks. On the other hand, in class-incremental CIFAR-100, previous knowledge can substantially speed up learning of new classes. Overcoming loss of plasticity is an important, but still the first, step towards the goal of fast learning on future data100,101,102. Once we have networks that maintain plasticity, we can develop methods that use previous knowledge to speed up learning on future data.

    Loss of plasticity is a critical factor when learning continues for many tasks, but it might be less important if learning happens for a small number of tasks. Usually, the learning system can take advantage of previous learning in the first few tasks. For example, in class-incremental CIFAR-100 (Fig. 2), the base deep-learning systems performed better than the network trained from scratch for up to 40 classes. This result is consistent with deep-learning applications in which the learning system is first trained on a large dataset and then fine-tuned on a smaller, more relevant dataset. Plasticity-preserving methods such as continual backpropagation may still improve performance in such applications based on fine-turning, but we do not expect that improvement to be large, as learning happens only for a small number of tasks. We have observed that deep-learning systems gradually lose plasticity, and this effect accumulates over tasks. Loss of plasticity becomes an important factor when learning continues for a large number of tasks; in class-incremental CIFAR-100, the performance of the base deep-learning system was much worse after 100 classes.

    We have made notable progress in understanding loss of plasticity. However, it remains unclear which specific properties of initialization with small random numbers are important for maintaining plasticity. Recent work103,104 has made exciting progress in this direction and it remains an important avenue for future work. The type of loss of plasticity studied in this article is largely because of the loss of the ability to optimize new objectives. This is different from the type of loss of plasticity in which the system can keep optimizing new objectives but lose the ability to generalize11,12. However, it is unclear if the two types of plasticity loss are fundamentally different or if the same mechanism can explain both phenomena. Future work that improves our understanding of plasticity and finds the underlying causes of both types of plasticity loss will be valuable to the community.

    Continual backpropagation uses a utility measure to find and replace low-utility units. One limitation of continual backpropagation is that the utility measure is based on heuristics. Although it performs well, future work on more principled utility measures will improve the foundations of continual backpropagation. Our current utility measure is not a global measure of utility as it does not consider how a given unit affects the overall represented function. One possibility is to develop utility measures in which utility is propagated backwards from the loss function. The idea of utility in continual backpropagation is closely related to connection utility in the neural-network-pruning literature. Various papers105,106,107,108 have proposed different measures of connection utility for the network-pruning problem. Adapting these utility measures to mitigate loss of plasticity is a promising direction for new algorithms and some recent work is already making progress in this direction109.

    The idea of selective reinitialization is similar to the emerging idea of dynamic sparse training110,111,112. In dynamic sparse training, a sparse network is trained from scratch and connections between different units are generated and removed during training. Removing connections requires a measure of utility, and the initialization of new connections requires a generator similar to selective reinitialization. The main difference between dynamic sparse training and continual backpropagation is that dynamic sparse training operates on connections between units, whereas continual backpropagation operates on units. Consequently, the generator in dynamic sparse training must also decide which new connections to grow. Dynamic sparse training has achieved promising results in supervised and reinforcement-learning problems113,114,115, in which dynamic sparse networks achieve performance close to dense networks even at high sparsity levels. Dynamic sparse training is a promising idea that can be useful to maintain plasticity.

    The idea of adding new units to neural networks is present in the continual-learning literature92,116,117. This idea is usually manifested in algorithms that dynamically increase the size of the network. For example, one method117 expands the network by allocating a new subnetwork whenever there is a new task. These methods do not have an upper limit on memory requirements. Although these methods are related to the ideas in continual backpropagation, none are suitable for comparison, as continual backpropagation is designed for learning systems with finite memory, which are well suited for lifelong learning. And these methods would therefore require non-trivial modification to apply to our setting of finite memory.

    Previous works on the importance of initialization have focused on finding the correct weight magnitude to initialize the weights. It has been shown that it is essential to initialize the weights so that the gradients do not become exponentially small in the initial layers of a network and the gradient is preserved across layers54,66. Furthermore, initialization with small weights is critical for sigmoid activations as they may saturate if the weights are too large118. Despite all this work on the importance of initialization, the fact that its benefits are only present initially but not continually has been overlooked, as these papers focused on cases in which learning has to be done just once, not continually.

    Continual backpropagation selectively reinitializes low-utility units. One common strategy to deal with non-stationary data streams is reinitializing the network entirely. In the Online Permuted MNIST experiment, full reinitialization corresponds to a performance that stays at the level of the first point (Extended Data Fig. 4a). In this case, continual backpropagation outperforms full reinitialization as it takes advantage of what it has previously learned to speed up learning on new data. In ImageNet experiments, the final performance of continual backpropagation is only slightly better than a fully reinitialized network (the first point for backpropagation in left panel of Fig. 1b). However, Fig. 1 does not show how fast an algorithm reaches the final performance in each task. We observed that continual backpropagation achieves the best accuracy ten times faster than a fully reinitialized network on the 5,000th task of Continual ImageNet, ten epochs versus about 125 epochs. Furthermore, continual backpropagation could be combined with other methods that mitigate forgetting, which can further speed up learning on new data. In reinforcement learning, full reinitialization is only practical for systems with a large buffer. For systems that keep a small or no buffer, such as those we studied, full reinitialization will lead the agent to forget everything it has learned, and its performance will be down to the starting point.

    Loss of plasticity might also be connected to the lottery ticket hypothesis119. The hypothesis states that randomly initialized networks contain subnetworks that can achieve performance close to that of the original network with a similar number of updates. These subnetworks are called winning tickets. We found that, in continual-learning problems, the effective rank of the representation at the beginning of tasks reduces over time. In a sense, the network obtained after training on several tasks has less randomness and diversity than the original random network. The reduced randomness might mean that the network has fewer winning tickets. And this reduced number of winning tickets might explain loss of plasticity. Our understanding of loss of plasticity could be deepened by fully exploring its connection with the lottery ticket hypothesis.

    Some recent works have focused on quickly adapting to the changes in the data stream120,121,122. However, the problem settings in these papers were offline as they had two separate phases, one for learning and the other for evaluation. To use these methods online, they have to be pretrained on tasks that represent tasks that the learner will encounter during the online evaluation phase. This requirement of having access to representative tasks in the pretraining phase is not realistic for lifelong learning systems as the real world is non-stationary, and even the distribution of tasks can change over time. These methods are not comparable with those we studied in our work, as we studied fully online methods that do not require pretraining.

    In this work, we found that methods that continually injected randomness while maintaining small weight magnitudes greatly reduced loss of plasticity. Many works have found that adding noise while training neural networks can improve training and testing performance. The main benefits of adding noise have been reported to be avoiding overfitting and improving training performance123,124,125. However, it can be tricky to inject noise without degrading performance in some cases126. In our case, when the data distribution is non-stationary, we found that continually injecting noise along with L2 regularization helps with maintaining plasticity in neural networks.

    [ad_2]

    Source link