Python: remove the imprecise container taint steps #17030

yoff · 2024-07-22T14:30:36Z

We used to have taint steps from any element of a collection to the entire collection (see here).
These are very imprecise, leading to false positives (e.g. seen here and here).
They are also at odds with how other languages treat collections, see our issue about this.

We wish to keep the semantics, that if a collection is tainted, then all elements are considered tainted. Therefor we now try to not taint collections, if we have precise information about which elements are tainted.
For a list, if an element is tainted, we do not know which one, so any read is potentially reading tainted information.
There is not much difference between the list having content and the list being tainted.
But for a dictionary, if an entry is tainted and we know which one, only reads of the appropriate key is reading tainted information. All other reads should ideally be considered safe (they used to not be). If we do not know that other keys are safe, e.g. if the collection came from an untrusted source, we can taint the collection itself, and all reads will be considered dangerous. So for collections with precise content, there is a big difference between having content and the collection being tainted.

Thus we wish to remove these imprecise taint steps for tuples and dictionaries, where we track content precisely (we keep them for lists and sets, where content is imprecise anyway).
This PR now seems to demonstrate that we can achieve this, although with some caveats:

we use implicit reads, which makes reasoning about use/use-flow and sinks a bit complicated
we need a solution to additional flow steps for conversions

The issue with conversions is as follows: We are moving away from tainting collections when only precise content is tainted. But some operations may read any of the collection elements, for instance decoders. A call like

tainted_obj = {"foo": TAINTED_STRING}
encoded = ujson.dumps(tainted_obj)

used to transfer taint from tainted_obj to encoded, via an additional taint step. But now there is no taint to transfer because tainted_obj itself is not tainted. Instead, it has to make a read step. Adding read steps is not trivial, though, the best mechanism we have is that of flow summaries, but it is awkward to use here, or two reasons:

We do not actually collect decoder calls, but rather their input and output, whereas flow summaries are formulated in terms of calls and access paths.
Non-monotonic recursion

We might also be able to get the converter reads as implicit reads, but there is currently no mechanism for doing that for all taint flow configurations at once.

python/ql/lib/semmle/python/frameworks/Stdlib.qll

+  /**
+   * Flow summaries for string manipulation methods.
+   */


python/ql/test/experimental/meta/ProgressiveTaintTrackingTest.qll

+ * global (inter-procedural) taint-tracking analyses.
+ */
+module TaintTracking {
+  import semmle.python.dataflow.new.internal.tainttracking1.TaintTrackingParameter::Public


for collections where one could read out a different element due to precise content

- fixes fully/partial SSRF confusion

Default implicit read steps changed the semantics of our taint tracking tests. This resets that semantics. We include two new annotations to allow testing with implicit reads, as well as a consistency query to prevent spurious implicit read steps.

expect empty query predicates

Consider how to make this maintainable. Could we explicitly disallow implicit reads at specific sinks instead of rebuilding the config without them?

This is probably what we want, if we can get support for it.

github-actions bot added the Python label Jul 22, 2024

yoff force-pushed the python/no-imprecise-container-step branch from 34e04ff to 07e3829 Compare July 24, 2024 15:34

github-advanced-security bot found potential problems Aug 7, 2024

View reviewed changes

yoff force-pushed the python/no-imprecise-container-step branch from 73a33bf to c891057 Compare August 7, 2024 11:50

aibaars mentioned this pull request Aug 19, 2024

Tuple Mismatching Results in all variables labelled as sensitive data #17255

Closed

yoff added 14 commits September 9, 2024 13:45

Python: remove the imprecise container taint steps

34eaef7

for collections where one could read out a different element due to precise content

python: add read steps for some specific functions

b82bcda

Python: accept fixed FP

efad89a

Python: allow taint from children to % format

e06386d

- fixes fully/partial SSRF confusion

Python: accept extra edges

eb12b2d

Python: NoSQL, allow implicit reads at sinks

a340713

Python: more sweeping change regarding implicit reads

ee594f9

Python: use qldoc

ef10bcc

Python: update test expectations

cc4f41b

expect empty query predicates

Python: more expectations

c61a4d6

Python: fix test and expectations

20dc6a6

Python: update test with implicit read example

3669307

Python: update line numbers

243f656

yoff force-pushed the python/no-imprecise-container-step branch from efcb918 to 243f656 Compare September 9, 2024 13:02

yoff added 3 commits September 11, 2024 13:05

Python: more precise emulation of real tainttracking

e158ccf

Consider how to make this maintainable. Could we explicitly disallow implicit reads at specific sinks instead of rebuilding the config without them?

Python: selective extra implicit reads [sketch]

046f890

This is probably what we want, if we can get support for it.

Python: make the decoder read step explicit

00befa0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: remove the imprecise container taint steps #17030

Python: remove the imprecise container taint steps #17030

yoff commented Jul 22, 2024 •

edited

Loading

Python: remove the imprecise container taint steps #17030

Are you sure you want to change the base?

Python: remove the imprecise container taint steps #17030

Conversation

yoff commented Jul 22, 2024 • edited Loading

yoff commented Jul 22, 2024 •

edited

Loading