-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why attribution reports cannot go to third-parties and to anything else than the registrable domain #57
Comments
Ping @csharrison and @erik-anderson. |
In the Click Through Conversion Measurement Event-Level API, we address this issue by requiring the destination site to trigger attribution using the same origin that reports will be sent to. Without 3P cookies, in order to properly trigger the attribution, you would need to fire an attribution redirect for every possible identifying origin which will be sent reports. On larger sites this becomes impossible due to the number of potential origins, and browsers can actively limit sites triggering large number of attribution redirects in an attempt to find the correct identifying origin. This is covered in this section of the Event-level explainer. A different motivating example for the use of third party reporting origins(or domains): Ads are served on Is this a valid use-case for PCM? With the reporting_origin approach in the Event-level explainer, |
The same goes for PCM but that only partially mitigates the problem. The click source can iterate through a set of domains on the attribution side. That won't get them to full user IDs on its own but only iterating through 64 different domains adds another 6 bits of entropy to the resulting report which allows them to categorize their user base into 64 buckets. That foils the purpose of limiting bits of entropy in source ID and trigger data.
It'll eventually get to the modern JS API way of signaling a conversion. With that, you can imagine wildcard conversion. The tracking pixel redirect was never intended as a scalable, good solution for the future. It's a legacy support measure. Obviously, supporting wildcard conversion signaling puts further pressure on the reporting URLs because now there is no scalability issue.
|
I'm very sympathetic to these concerns, but I'm also a bit worried that this might lead to a situation where setting up an effective reporting structure including a third-party ad tech provider is technically challenging for developers and thus makes it harder for smaller parties (on all sides) to effectively run their business. I also think that there's actually a chance for additional transparency to the user when the reporting origin is clearly revealed in browser UI instead of hidden behind a server-side redirect by the click source. Did you consider letting the ad click destination list all possible reporting domains in some .well-known location in advance instead? That feels like it would resolve the issue of custom eTLDs without sacrificing too much flexibility for developers. |
The reports go to a very specific location. I imagine there will be services offered to listen to incoming requests on that endpoint, parse the data, and communicate it according to how the business is set up. By always sending data to first parties, we align with user expectations and there cannot be any doubt in who is in control of that data.
I don't follow this. Where would this browser UI be and what would it show?
First of all, there would have to be a limit on the size of that list and I'm not sure we can come up with a tradeoff between usefulness and opportunity for misuse. Second, there is no way for browsers to know that all users are being served the same list when calling that .well-known location for a particular site. The list could change based on incoming cookies if such are sent or on network properties if not. Finally, there is no way for a browser to know if e.g. 16 domains on that list are 16 distinct and legitimate reporting domains or if they are 16 domains owned by the same tracking company, allowing that company to categorize users into 16 buckets, effectively granting them 4 extra bits of entropy. |
I think this is a great idea.
I wouldn't jump to such a conclusion too soon. I wouldn't be surprised if many advertisers only have a very small number of 3rd parties they want looped into their measurements.
I believe there are technical enforcements to make sure this is the case. A few quick ideas:
I am sure we could come up with other ideas in this space.
Agree this should be in our threat model. |
Oh, I was referring to the (very hypothetical) UI we thought up in #54 and related discussions. The user could have a way to see pending attributions with the tuple (click source, ad source, reporting origin) instead of just (click source, ad source) which might give more transparency over who will end up handling the data.
As @csharrison said keeping this list very small (4 slots?) might already be enough, we could still allow reporting to the click origin to ensure that we don't disadvantage models that don't centralize data collection through an ad tech service (if those exist).
What's the difference between that and the final attribution request, meaning couldn't the same measures be applied to both requests, e.g. not sending cookies and delaying the request by 24h? (Without the other measures Charlie described) the advertiser may encode additional data based on IP address but won't the receiver observe the same address then, making additional identifiers unnecessary?
I agree that enumerating reporting origins is another potential source of bits for a dedicated attacker, by grouping users into n buckets using custom eTLDs. Leaving this unchecked is essentially destroying all privacy guarantees the spec otherwise tries to enforce. Hence I'm trying to suggest a way to at least control n so that we can reason about how much entropy we're adding in the worst case and find a compromise based on that. |
Although I'm in favor of the idea of transparency UI, I don't think it'll serve as a meaningful defense against misuse, especially not for the majority of users. Users know about first party websites, that's about it.
There is always a risk of creating barriers to entry with such small limitations. Who do you think will get to be among the 4? 😕 Additionally, PCM has received positive feedback for its support of multiple sources of attribution, including for a single conversion.
IP address tracking is a separate thing that doesn't improve or worsen any of this. It will have to be dealt with separately. We shouldn't use IP address tracking as an argument as to why we don't need other protections. PCM is designed to be privacy-preserving on the web platform level.
There is zero room for additional bits of entropy. If there was any additional room, it should be spent on the specified data values sourceID and triggerData. If we start saying we can allow a few more bits, we've lost the privacy-preserving properties of PCM which is one of the reasons for doing it at all. |
I think that Charlie's technical enforcement ideas have merit. Each idea adds some level of additional complexity for the browser vendor, either with standing up a service to cross-check what different clients are seeing (and reasoning through the privacy implications of that service) and/or having additional networking requests from different contexts. @johnwilander do you have any thoughts on the viability of those mitigations? |
I think they add too much complexity, not just for browser vendors but also for developers. Complexity often translates to barrier to entry which can lead to only large, well-funded adtech vendors being able to set everything up. Just to drill into potential complexities, let's say the user clicks an ad on Day 1, converts on Day 4, and the report is supposed to be sent to ThirdParty.example on Day 6. At which day should the browser check the validity of the ThirdParty.example endpoint for this advertiser? If it's not Day 6, things may have changed and there's no way for someone inspecting the advertiser's website on Day 6 to see that reports are allowed to go to ThirdParty.example. If it is Day 6, then we have to support some kind of time stamping of report endpoints so that the advertiser's website can state that "Conversions between these timestamps can go to ThirdParty.example but nowadays I don't use ThirdParty.example anymore because I've switched to OtherThirdParty.example." In addition, I think too much time and effort is spent on trying to cater for the old ways of doing things. PCM is not trying to be a drop-in replacement for how things worked in the world of cross-site tracking and tons of third-parties collecting and sending data. This is about a new world where it's reasonably clear to users, developers, and advertisers where data is sent. If they want to share that data with a business partner, it's on them to make that clear to their users and also to live up to legal requirements for data sharing in the current jurisdiction. |
@johnwilander I'm having a hard time seeing your argument that they "add too much complexity[...] for developers." For a typical website that wants to use a third-party reporting endpoint, their marginal setup effort could be as low as adding a static text file at a .well-known location. This is orders of magnitude simpler than setting up any server-side proxying approach or splitting off a CNAME'd subdomain. Sure, there is some complexity for browser vendors, which is appropriate; that's our job. But none of these potential complexities seem particularly challenging. |
For a simple, benign, "happy path" case, it might work functionally. That would be advertiser.example choosing adtech.example as its one and only reporting endpoint forever. But unless we allow them to change the endpoint, we will truly create a barrier to entry. So some kind of managed change needs to be supported. Allowing multiple reporting endpoints could allow for flexibility in a purely additive way as long as the advertiser doesn't reach the limit and perhaps cover most cases. But as mentioned above, a set of reporting endpoints can be doctored to convey cross-site data. Let's then assume that we restrict it to a single reporting endpoint that can be changed. If we don't add further restrictions, advertiser.example can cycle through reporting domains time01.example, time02.example, …, time24.example by the hour and that way encode when in time the conversion happened. I.e. cross-site data leakage. If we say the change needs to be mirrored in time on the click side, you now have a sync issue where the advertiser needs to tell all publishers that when the clock strikes twelve, all must change some file on their server. And even that would be susceptible to gaming with synchronized changes every eight hours (asiaTracking.example, emeaTracking.example, and americaTracking.example) or every three hours (morningTracking.example, middayTracking.example etc). To solve for that we'd need browsers checking multiple times at random which will create a risk of data loss during changes to the reporting endpoint or create a barrier to entry because no one ever wants to change their reporting endpoint. |
It seems to me that checking the desired reporting endpoint at report-send time avoids all the timing attacks. It does introduce some complexity if you go with an approach that checks for consistency across contexts, although there are technical solutions beyond that approach. Even in that case though I don't think it's an unreasonable amount of complexity. When you want to add another reporting endpoint you can "pre-declare" it in your configuration and you include a date when you want it in effect. |
I agree that a single reporting endpoint grabbed at the time of reporting is the safest from a cross-site tracking perspective, simply because it is equivalent to the report going to either the click source or the advertiser, and them either forwarding that info or redirect the request (if allowed). What remains then is the user perspective and transparency. The user perspective. Users don't know about third-parties and we are sending data about their activity. Users won't expect their user agent to send that data to a third-party and it'll be hard to explain to them why. Transparency. There will be no way to tell the user where their data will go before the time of reporting. This means there will be no way for the user to inspect a website at the time of ad click or time of conversion to see where data about their activity will ultimately go. At a random time 24-48 hours later, a report might be sent to a third-party domain they have never seen and will never see. |
I think it's a bit simplistic to think that just because the user sees a URL then automatically it will go to just the party owning the URL. Millions of businesses rely on Shopify for putting up their shopping cart solution. In this case it's really Shopify getting the data, will the user not be surprised by it? Shopify then shares this data server side with all of the apps in the app exchange. There are many other equivalent platforms that help you build sites and collect data and share it around. |
I'm not saying users will understand that data will go to the first party. I'm just saying they will be almost guaranteed to not understand that the data goes to a third-party straight from their browser. I think our best chance of getting users' buy-in on measurement of online advertising is making it as easy as possible to explain to them what's happening and align the measurement practice with their mental model of browsing the web. |
Every time a browser renders a web page, it offers a channel by which the first party triggers arbitrary requests to third parties.
Sure, but it seems reasonable to communicate this appropriately. "This report will be sent in approximately 17 hours, to shoes.example, or to a 3rd-party service who collects data on their behalf. (Shoes.example currently uses nifty-analytics.example for this.)" |
There might be a middle ground between the extremes of same site only and allowing reports to go to arbitrary URLs. Could the browser limit the URLs that can receive reports to a set of developer-friendly but tracking-unfriendly patterns based on the first party URL? So a click when the first party is example.com could send reports to
but not to 487f90aa469c6234.customTLD? |
Just to repeat the current state of our alternative suggestion as I see it (mixing ideas from different comments here): Walking through a regular click to conversion event:
(the only request in this list that happens in a non-isolated fashion, sending cookies etc. is the top-level navigation to the destination) Now a practical issue for websites and adtech is of course changing their infrastructure between the time when conversion reports are sent out. While we shouldn't completely ignore that issue in practice I don't think it should drive our decision-making here. The other thing that we haven't fully resolved from a privacy perspective is hard-coded "bucketizing" on the side of the ad-tech company to add (a low number of) additional bits. I think that this is controllable both by strongly limiting the number of allowed reporting endpoints as well as simply enforcing regulatory action against advertisers, ad sources and ad tech. All of them have a high stake in not getting their domain denylisted while browsers are at little risk in blocking (or normalizing) conversion requests from bad players. So I think/hope this can cover all the (very valid) concerns from @johnwilander (except more complexity at the browser side which I think is a fair price to pay for us). Specifically I hope that the UI pieces make it clear how I would see this being presented/explained to users. Please let me know if I got anything wrong. :) |
@dmarti What about 487f90aa469c6234.customTLD/privateclick/example.com in that scenario? |
@johannhof The browser would have to match some known TLDs based on their registration policies and costs. If I understand the bucketizing problem, it's tracking companies registering say 64 domains and encoding a unique ID based on which domains the click is reported to. Agree that the number of endpoints should be limited in the browser. Possibly shuffle the list of endpoints for a click, always report to the first n endpoints on the shuffled list, then start randomly dropping the extras. |
Recapping the Jan 14 conversation on this issue: We asked industry representatives for their opinion on:
There was generally a very positive sentiment towards the newly proposed model of flexible reporting endpoints via declaration in .well-known, saying that it would be hard for smaller parties to adopt this standard otherwise. There was a sentiment that smaller publisher may find forwarding technically very challenging, though at least one larger vendor said that they could probably make things work in a forwarding model. Multiple parties also brought up the concern that forwarding could lead to lock-in effects where it's hard for publishers to switch providers vs. just returning a different reporting endpoint. A lot of folks also supported the idea of enabling a multi-endpoint model, again noting that it would otherwise be hard for smaller sites and marketers to pick this up. I guess it's up for discussion to what extent we can enable this while keeping potential "bucketizing" in check. |
In the Jan 14th conversation I alluded to optionally using some of the publisher side bits to choose among multiple reporters. This would address the issue of having lock in with a single reporter. If multiple reporters were allowed, this prevents having to send a single parties report to all of the other reporters. The concern was that this would allow personalized endpoints due to having access to the 1P cookie when the anchor tag is configured. I think if we combined this idea with the approach in this comment, there would be no risk of personalization. Concretely: Take two (as an example) of the publisher side bits and encode them as a The checks at click + report time prevent any form of personalization from being effective. |
I think it's a good idea. This will provide advertisers the opportunity to use unbiased solutions to measure their attribution without depending on the publisher to report the data directly to them. It will also reduce development from the publishers. |
Could we imagine a registry of third party domain allowed? In your example, if an ad network declare in the ad to custom report on That will prevent any use of domain for cross site tracking. |
As long as there's only on domain to report the attribution, it prevents the cross site tracking. We can't expect every publisher and and every advertiser to develop with own systems to deal with this process and it's likely they will prefer to use a 3rd party to do that. The current proposal prevent that. |
I sympathize with the issue, and think we should find a solution to that, but the proposed setup isn't free from problems. The current proposal where data doesn't flow to the third party directly basically incentivizes server side integrations like Facebook Conversions API in which significant amounts of PII data can be exchanged with minimal to no ability to enforce any sort of compliance from a regulator or a browser. If we're looking to make the web more privacy preserving we should consider also the consequences of certain decisions. I don't think in good faith anyone could consider this way of sharing data an improvement over the existing way. |
Maybe it would be easier to resolve this issue if it was separated in 2 issues?
For smaller publishers, using a third party platform (and thus domain) to help them monetize their inventory is often more economically viable than having their own IT and salesforce. They need to publish content, work on SEO, find advertising budgets and a lot more. |
The iCloud IP relay service is subscription only. Its a great service and would be nice if it was for everybody, but its not, - so does not solve this privacy issue. If it was available, and the browser could detect it, then maybe this could work in those instances. |
Actually it is available in the browser for free i think and i think i read that PCM will send the reports through it anyway. |
Can you post a link? |
Here you go https://developer.apple.com/documentation/safari-release-notes/safari-15-beta-release-notes both for the pcm reports and hide ip from trackers there's no note about private relay being needed. |
This is great, it should go into the spec i.e. if browsers use IP hiding reports can go to arbitrary destinations (as long as the urls are entropy limited - e.g. no subdomains and fixed /.well-known path). |
One third-party reporting domain per impression would immediately defeat the privacy protections, just as I explained above: "They could simply put a unique event ID in the domain, such as 487f90aa469c6234.customTLD and be back to web scale event-level cross-site tracking." For this to work, it would have to be one third-party per website and stable over time. The way we'd have to do it is check the endpoint twice – once at click time, and once at attribution report time, to make sure it stays the same. That's the design for checking the public key of the optional click fraud token. I believe such a design was briefly discussed above or in a related issue. IP protection is indeed what is needed here. It enables click fraud tokens too. We'd likely have to write the spec so that only browsers with IP address protection should allow click fraud tokens and third-party reporting domains. |
Note that such a spec update would say that IP address protection is only needed for these specific parts of PCM, not in general. I.e. it wouldn't make PCM impossible to support without paying browser customers or a bunch of other funding. |
Publishers and advertisers often put in competition multiple ad tech vendors to get the most revenues, especially small and medium business. |
Maybe the IP relay ingress server could count the number of subdomains in third-party domains, & refuse to forward them after the list goes > 10. There would have to be a side channel so it can recognise PCM reports, see the taget url etc. |
Ingress doesn't know the url of the site being requested but i had similar ideas. |
Do you mean to send the attributions to the same set of third-parties every time? Because if sites were allowed to pick and choose per impression or conversions, they could do so to boost the number of bits that get through. One thing to be aware of is that a soon as the configured third-party domain(s) change(s), all pending clicks and attributions will be deleted since the browser will not be able to tell if the changed configuration is an attempt to boost the number of bits that get through. I would imagine that with a possible larger set of third-party domains, the likelihood of desired changes will be larger too. |
Think of it this way: Everything that is site configuration has to be static for reports to be sent. If anything changes, the browser will bail out and delete all matching data. All dynamism has to go into the source ID and the trigger data. Those values are specified to exactly control how much data can get through in a report. If anything else is allowed to be dynamic, that can be made to carry more bits of data which will 1) violate the intended privacy guarantees of the feature, and 2) lock those extra bits to that particular dynamism and it's always better to just boost source ID and trigger data if we believe we can protect privacy anyway. Better because source ID and trigger data are easier to reason about, easier to explain, and provide maximum flexibility to developers. |
Does that work if https://www.tracking.adtech.com is the only valid and static tracking FQDN to record conversion for adtech.com in the internet ? Let's summarize:
|
How do you stop each browser instance getting a different value for attributionreporting, or how does one browser instance detect that? |
I would love to hear what could be a privacy problem is in what I am proposing. if attributionreporting gets the value https://www.tracking.adtech.com/ and then https://www.tracking2.adtech.com/ we would invalidate all data for https://www.tracking.adtech.com/ as there can only be one FQDN registered for *.adtech.com in an instance. |
How would you guarantee that there’s only one registration? I’ve probably missed that part. How would you defend against a party who owns a whole TLD, like .adtech? They can register as many domains as they want for free. This is mentioned above. |
I see, thanks John. Could we consider that the ad-tech player defines the tracking domain in a new file /.well-known/tracking-url . |
.well-known/tracking-url should not work on the subdomain. For example https://www.tracking.adtech.com/.well-known/tracking-url or https://www.tracking2.adtech.com/.well-known/tracking-url would check https://www.adtech.com/.well-known/tracking-url . The approach would be similar to what is already done in the industry to fight fraud with ads.txt ( https://iabtechlab.com/ads-txt/ ) in which publisher register valid vendors with their id and tokens. ( https://www.nytimes.com/ads.txt , https://www.bbc.com/ads.txt etc... ) |
Hi! |
Is there any updated plan to support this ? As it is, the restriction will give a big advantage to google and Facebook for the ads they are serving from their sites and will probably impact significantly revenue of small and medium sized publishers. |
Hi 👋 |
Now that in iOS 15.1 trackers traffic identified by the ITP algorithm are using private relay fully, I am not sure how a third party can forge the identity of anyone in safari. I think that private relay is a major innovation from Apple 👏 👏 in respecting the privacy of internet users and I am a big fan of it. I understand that it will require funding for other browser but for Safari I do not see how it is possible unless a user explicitly wants to be identified for trackers. |
Hi team, |
Two requests were brought up at a recent Privacy CG call and I said I'd write up the privacy analysis of why we think attribution reports cannot go to third-parties and to anything else than the registrable domain (eTLD+1).
Why Not Attribution Reports To Third Parties?
Some have requested that the click source site should be able to assign a reporting URL/domain other than its own. Others have requested that a third-party such as the host of an iframe where the click happens should be the one receiving the report.
Neither of these meet our privacy requirements. In both cases, the domains can be chosen to convey further information about the click.
Imagine for instance social.example where the ad click happens saying they want reports to go to johnwilander-social.example when I'm logged in there and to janedoe-social.example when Jane Doe is logged in. That would take us back to cross-site tracking in the subsequent report.
Similarly, ad links can be made to be served in iframes from johnwilander-social.example or janedoe-social.example to achieve the same level cross-site tracking.
Even Worse With Custom eTLDs
This issue becomes worse with tracking companies owning their own eTLDs under which it's virtually free for them to register new domains. They could simply put a unique event ID in the domain, such as 487f90aa469c6234.customTLD and be back to web scale event-level cross-site tracking.
Why Not Attribution Reports To Subdomains?
Some have requested that attribution reports be sent to the full domain of the site where the click happens and similarly the full domain of the site where the conversion happens.
Neither of these meet our privacy requirements. In both cases, subdomains can be chosen to convey further information about the click or conversion.
Imagine for instance social.example where the ad click happens making sure the site is loaded from the subdomain johnwilander.social.example when I'm logged in there and from the subdomain janedoe.social.example when Jane Doe is logged in. That would take us back to cross-site tracking in the subsequent report.
The reason for restricting PCM reports to registrable domains is that the scheme+registrable domain, a.k.a. schemeful site, is the only part of a URL that is free from link decoration. All other parts can be made user specific, including subdomains.
You could of course imagine social.example setting up a registrable domain per user, such as johnwilander-social.example, and load the whole website from that domain when I'm logged in to get back to cross-site tracking of clicks. If that happens, we'd have to deal with it but at least the user has a chance to see that a personalized domain is used through the URL bar.
The text was updated successfully, but these errors were encountered: