Codeql Principle Analysis (Part One)

Overview of CodeQL Principles#

The entire vulnerability scanning process of CodeQL can be divided into two parts:

AST database creation, which can be done using command line tools
Rule writing, using a syntax similar to SQL to write vulnerability query rules

DraggedImage

Divided into compiled languages and non-compiled languages

Non-compiled languages, such as Python, JS

codeql database create --language=javascript --source-root <folder-to-extract> databaseName

Compiled languages, such as C++, Go

codeql database create "xxx" --language=cpp -c "gcc main.c"

The compiled database looks like this:

.
├── baseline-info.json
├── codeql-database.yml
├── db-javascript
│   ├── default
│   ├── semmlecode.javascript.dbscheme
│   └── semmlecode.javascript.dbscheme.stats
├── diagnostic
│   ├── cli-diagnostics-add-20231128T030607.340Z.json
│   ├── extractors
│   └── tracer
├── log
│   └── database-create-20231128.110453.457.log
└── src.zip

The key is the db-javascript folder, which contains the corresponding AST structure database
The log directory is related to logs
src.zip is a backup of the source code

Rule Writing - Taint Analysis#

A Simple Rule#

For example, an arbitrary file read vulnerability, where the fs.readFile function has an arbitrary file read vulnerability

const express = require('express');
const fs = require('fs');
const app = express();

app.get('/getFile', (req, res) => {
  const fileName = req.query.fileName;
  fs.readFile(fileName, 'utf8', (err, data) => {
    if (err) {
      res.status(500).send(err.toString());
    } else {
      res.send(data);
    }
  });
});

app.listen(3000);

We can extract all the code containing the fs.readFile function and write the rule as follows:

/**
 * @name fs-read-file
 * @kind problem
 * @problem.severity warning
 * @tags correctness
 * @id js
 */

import javascript

from CallExpr fsReadFile
where
  fsReadFile.getCalleeName() = "readFile"
select fsReadFile, "This is a call to fs.readFile."

The essence of CodeQL queries is to search through the AST tree, with a syntax similar to SQL

from defines variables, for example, if we want to query readFile, this is a function call expression CallExpr
There are also some constraints, such as the name of the called function in this function call expression, similar to the where clause in SQL
Finally, select the query results

Of course, it is obvious that there are a large number of false positives, as all fs.readFile functions will be considered vulnerable

DraggedImage-1

So we face a problem: how to distinguish between normal code and tainted code sink

Taint Analysis Principles#

DraggedImage-2

As shown in the figure, we define

source: data source, representing the direct introduction of untrusted data or sensitive data into the system
sink: taint aggregation point, representing the direct execution of security-sensitive operations or leaking sensitive data to the outside

The nodes passed through can be considered the smallest tokens in lexical analysis, representing constructs like if, while, for, function calls, etc.

The steps of taint analysis:

As long as we mark the source and sink, CodeQL will determine that if there is a path from source to sink, it will be considered a vulnerability

DraggedImage-3

Limitations of Taint Analysis#

The official principles are not disclosed, but after multiple rounds of testing, we found that CodeQL tracks assignments most accurately; paths are often formal and do not consider language features:

Code 1:

let x = process.argv[0]
let ctrl = 1
let y

if(ctrl > 0){
    y = 1
}else{
    y = x
}
eval(y)

source is defined as process.argv[0]
sink is defined as eval(y)

Even though ctrl is always greater than 0, CodeQL still determines that there is a path from source to sink

Code 2

let x = process.argv[0]
Object.prototype.a = x

let y = {}
eval(y.a)

// or
let x = process.argv[0]

let c = {}
c.a = x

let y = {
    b:c
}
eval(y.b.a)

In fact, there are issues here, but CodeQL cannot recall them

Refining Taint Rules#

CodeQL's judgment is quite limited; simply marking source and sink can lead to a large number of false positives and negatives.

Analyzing these situations, we can roughly categorize the paths from source to sink into the following four types (excluding additional checks needed for code features):

Passing through a series of normal if, while, etc., to reach the sink
Passing through a filter function to reach the sink
Passing through a check function, where if conditions are not met, it cannot reach the sink, but formally it can still flow to the sink
Passing through a normal join function to reach the sink

DraggedImage-4

Data sources reach the taint sink through harmless processing

const express = require('express');
const fs = require('fs');
const app = express();
  
function sanitizePath(path) {
  // Simple cleaning logic, e.g., removing path navigation characters
  return path.replace(/(\.\.\/|\/\.\.)/g, '');
}
  
app.get('/getFile', (req, res) => {
  const fileName = req.query.fileName;
  const safeFileName = sanitizePath(fileName);
  fs.readFile(safeFileName, 'utf8', (err, data) => {
    if (err) {
      res.status(500).send(err.toString());
    } else {
      res.send(data);
    }
  });
});
  
app.listen(3000);

Data source checks fail and return directly

const express = require('express');
const fs = require('fs');
const app = express();
  
const SAFE_DIRECTORY = '/path/to/safe/directory';
  
app.get('/getFile', (req, res) => {
  const fileName = req.query.fileName;
  if (!fileName.startsWith(SAFE_DIRECTORY)) {
    // Not allowed to read
    return res.status(403).send('Access denied');
  }
  fs.readFile(fileName, 'utf8', (err, data) => {
    if (err) {
      res.status(500).send(err.toString());
    } else {
      res.send(data);
    }
  });
});
  
app.listen(3000);

Data source directly reaches the taint, or passes through multiple processes to reach the taint (actually two types)

const express = require('express');
const fs = require('fs');
const path = require('path');
const app = express();

app.get('/getLog', (req, res) => {
  const logFile = req.query.logFile;
  const logPath = path.join(__dirname, 'logs', logFile); // Passed through path.join 
  fs.readFile(logPath, 'utf8', (err, data) => {
    if (err) {
      res.status(500).send(err.toString());
    } else {
      res.send(data);
    }
  });
});

app.listen(3000);

Thus, we need to do additional processing; in addition to defining source and sink, we also need to add barrier, sanitizer, and AdditionalTaintStep.

filter: harmless processing (sanitizer), representing the use of data encryption or removal of harmful operations to ensure that data propagation no longer poses a threat to the software system's information security
barrier: unlike sanitizer, a barrier does not clean or change data but serves as a condition check or decision point to block certain paths of data flow.
AdditionalTaintStep: the source may disconnect during transmission, requiring manual connection

Writing Taint Analysis Rules#

CodeQL provides two ways to analyze:

Static rules, i.e., AST tree queries, presented in terms of AST tree nodes, belonging to static analysis
Dynamic rules, i.e., data flow DataFlow queries, abstracted as DataFlow::Node base class, belonging to dynamic analysis

The AST tree is easy to understand; DataFlow can be a bit abstract, mainly divided into several different nodes:

DataFlow::Node can represent any element in the program, the base class, i.e., tokens in lexical analysis

var x = 10; // 'x' and '10' can both be instances of DataFlow::Node
var y = x + 5; // 'y', 'x + 5', 'x', and '5' are also instances of DataFlow::Node

DataFlow::ValueNode is used to represent values or expressions in the program

var name = "Alice"; // 'name' and "Alice" are both instances of DataFlow::ValueNode
function greet() {
    return "Hello, " + name; // 'return "Hello, " + name;' is an instance of DataFlow::ValueNode
}

DataFlow::SourceNode represents input points in the program, such as user input, file reads, etc.
DataFlow::SinkNode represents points where sinks may exist
DataFlow::PathNode is used for sensitive data analysis variables, with no special purpose

The general process for writing rules is:

Define the source, i.e., what the input data source is, which can be understood as externally input data
Define the sink, i.e., tainted code, such as readFile
Define isBarrier, isSanitizer, isAdditionalTaintStep (not required, just to reduce false positives and negatives)

The general framework is as follows:

import javascript
import DataFlow::PathGraph
import Express

class FileReadFromUserInput extends TaintTracking::Configuration {
  FileReadFromUserInput() { this = "FileReadFromUserInput" }

  override predicate isSource(DataFlow::Node source) {
		// Define Source
  }
  override predicate isSink(DataFlow::Node sink) {
		// Define Sink
  }
}

CodeQL uses a function-like syntax combined with AST syntax tree queries to mark points, such as marking the sink point at eval

   override predicate isSink(DataFlow::Node sink) {
    // Eval is a sink
    exists(CallExpr call |
        call.getCalleeName() = "eval" and
        sink.asExpr() = call.getArgument(0)
      )
   }

Finally, the query:

from FileReadFromUserInput cfg, DataFlow::PathNode source, DataFlow::PathNode sink
where cfg.hasFlowPath(source, sink)
select sink.getNode(), source, sink, "$@ flows to $@ ", source.getNode(), source.toString(), sink.getNode(), sink.toString()

Source#

In the express framework, the source is generally fixed, i.e., req.query.xxxx is the externally input data.

The writing of the source is essentially through AST syntax tree queries. For existing frameworks like express, CodeQL has pre-written rules that can be used directly.

override predicate isSource(DataFlow::Node source) {
    exists(Express::RouteHandler rh, DataFlow::SourceNode sn |
        sn = rh.getARequestSource() and
        source = sn.getAPropertyRead("query").getAPropertyRead()
    )
}

exists is also a predicate, introduced here for simplicity to create temporary variables
Express::RouteHandler is a built-in rule that can find the corresponding route handling code in express

Note that the = should be read from right to left; it essentially means assignment.

After writing, you can immediately select to see the results:

DraggedImage-5

Sink#

Since this is handling arbitrary file read vulnerabilities, the sink is the readFile call we initially wrote. Additionally, it is important to annotate the corresponding sink point.

The first argument of fs.readFile is a potentially controllable point, so through

sink.asExpr() = call.getArgument(0) (the sink here needs to be type-converted, as DataFlow and AST nodes are two models)

  override predicate isSink(DataFlow::Node sink) {
    exists(CallExpr call |
        call.getCalleeName() = "readFile" and
        sink.asExpr() = call.getArgument(0)
      )
  }

AdditionalTaintStep#

However, in this case, there will be vulnerabilities that cannot be found because it has passed through path.join, and the source has been transmitted.

const express = require('express');
const fs = require('fs');
const path = require('path');
const app = express();

app.get('/getLog', (req, res) => {
  const logFile = req.query.logFile;
  const logPath = path.join(__dirname, 'logs', logFile);
  fs.readFile(logPath, 'utf8', (err, data) => {
    if (err) {
      res.status(500).send(err.toString());
    } else {
      res.send(data);
    }
  });
});

app.listen(3000);

So we need to write an additional isAdditionalTaintStep, which means that if we encounter a function like join, we can consider that the source has undergone a transmission, and subsequent nodes can continue to connect.

  override predicate isAdditionalTaintStep(DataFlow::Node pred, DataFlow::Node succ) {
    exists(CallExpr call |
        call.getCalleeName() = "join" and (
            pred.asExpr() = call.getAnArgument() and succ.asExpr() = call
        )
    )
  }

Final Summary of Rules and Result Analysis#

The final rule summary still has some false positives, as it does not handle sanitizers.

/**
 * @name file-read-from-user-input
 * @kind path-problem
 * @problem.severity warning
 * @tags correctness
 * @id js
 */

import javascript
import DataFlow::PathGraph
import Express

class FileReadFromUserInput extends TaintTracking::Configuration {
  FileReadFromUserInput() { this = "FileReadFromUserInput" }

  override predicate isSource(DataFlow::Node source) {
      exists(Express::RouteHandler rh, DataFlow::SourceNode sn |
          sn = rh.getARequestSource() and
          source = sn.getAPropertyRead("query").getAPropertyRead()
      )
  }
  override predicate isSink(DataFlow::Node sink) {
    exists(CallExpr call |
        call.getCalleeName() = "readFile" and
        sink.asExpr() = call.getArgument(0)
      )
  }

  override predicate isAdditionalTaintStep(DataFlow::Node pred, DataFlow::Node succ) {
    exists(CallExpr call |
        call.getCalleeName() = "join" and (
            pred.asExpr() = call.getAnArgument() and succ.asExpr() = call
        )
    )
  }

}

from FileReadFromUserInput cfg, DataFlow::PathNode source, DataFlow::PathNode sink
where cfg.hasFlowPath(source, sink)
select sink.getNode(), source, sink, "$@ flows to $@ ", source.getNode(), source.toString(), sink.getNode(), sink.toString()

The results of the select can also show the data flow direction

DraggedImage-6