Overview of CodeQL Principles#
The entire vulnerability scanning process of CodeQL can be divided into two parts:
- AST database creation, which can be done using command line tools
- Rule writing, using a syntax similar to SQL to write vulnerability query rules
Divided into compiled languages and non-compiled languages
- Non-compiled languages, such as Python, JS
codeql database create --language=javascript --source-root <folder-to-extract> databaseName
- Compiled languages, such as C++, Go
codeql database create "xxx" --language=cpp -c "gcc main.c"
The compiled database looks like this:
.
├── baseline-info.json
├── codeql-database.yml
├── db-javascript
│ ├── default
│ ├── semmlecode.javascript.dbscheme
│ └── semmlecode.javascript.dbscheme.stats
├── diagnostic
│ ├── cli-diagnostics-add-20231128T030607.340Z.json
│ ├── extractors
│ └── tracer
├── log
│ └── database-create-20231128.110453.457.log
└── src.zip
- The key is the
db-javascript
folder, which contains the corresponding AST structure database - The
log
directory is related to logs src.zip
is a backup of the source code
Rule Writing - Taint Analysis#
A Simple Rule#
For example, an arbitrary file read vulnerability, where the fs.readFile
function has an arbitrary file read vulnerability
const express = require('express');
const fs = require('fs');
const app = express();
app.get('/getFile', (req, res) => {
const fileName = req.query.fileName;
fs.readFile(fileName, 'utf8', (err, data) => {
if (err) {
res.status(500).send(err.toString());
} else {
res.send(data);
}
});
});
app.listen(3000);
We can extract all the code containing the fs.readFile
function and write the rule as follows:
/**
* @name fs-read-file
* @kind problem
* @problem.severity warning
* @tags correctness
* @id js
*/
import javascript
from CallExpr fsReadFile
where
fsReadFile.getCalleeName() = "readFile"
select fsReadFile, "This is a call to fs.readFile."
The essence of CodeQL queries is to search through the AST tree, with a syntax similar to SQL
from
defines variables, for example, if we want to queryreadFile
, this is a function call expressionCallExpr
- There are also some constraints, such as the name of the called function in this function call expression, similar to the where clause in SQL
- Finally, select the query results
Of course, it is obvious that there are a large number of false positives, as all fs.readFile
functions will be considered vulnerable
So we face a problem: how to distinguish between normal code and tainted code sink
Taint Analysis Principles#
As shown in the figure, we define
- source: data source, representing the direct introduction of untrusted data or sensitive data into the system
- sink: taint aggregation point, representing the direct execution of security-sensitive operations or leaking sensitive data to the outside
The nodes passed through can be considered the smallest tokens in lexical analysis, representing constructs like if, while, for, function calls, etc.
The steps of taint analysis:
As long as we mark the source and sink, CodeQL will determine that if there is a path from source to sink, it will be considered a vulnerability
Limitations of Taint Analysis#
The official principles are not disclosed, but after multiple rounds of testing, we found that CodeQL tracks assignments most accurately; paths are often formal and do not consider language features:
Code 1:
let x = process.argv[0]
let ctrl = 1
let y
if(ctrl > 0){
y = 1
}else{
y = x
}
eval(y)
- source is defined as
process.argv[0]
- sink is defined as
eval(y)
Even though ctrl
is always greater than 0, CodeQL still determines that there is a path from source to sink
Code 2
let x = process.argv[0]
Object.prototype.a = x
let y = {}
eval(y.a)
// or
let x = process.argv[0]
let c = {}
c.a = x
let y = {
b:c
}
eval(y.b.a)
In fact, there are issues here, but CodeQL cannot recall them
Refining Taint Rules#
CodeQL's judgment is quite limited; simply marking source and sink can lead to a large number of false positives and negatives.
Analyzing these situations, we can roughly categorize the paths from source to sink into the following four types (excluding additional checks needed for code features):
- Passing through a series of normal if, while, etc., to reach the sink
- Passing through a filter function to reach the sink
- Passing through a check function, where if conditions are not met, it cannot reach the sink, but formally it can still flow to the sink
- Passing through a normal join function to reach the sink
- Data sources reach the taint sink through harmless processing
const express = require('express');
const fs = require('fs');
const app = express();
function sanitizePath(path) {
// Simple cleaning logic, e.g., removing path navigation characters
return path.replace(/(\.\.\/|\/\.\.)/g, '');
}
app.get('/getFile', (req, res) => {
const fileName = req.query.fileName;
const safeFileName = sanitizePath(fileName);
fs.readFile(safeFileName, 'utf8', (err, data) => {
if (err) {
res.status(500).send(err.toString());
} else {
res.send(data);
}
});
});
app.listen(3000);
- Data source checks fail and return directly
const express = require('express');
const fs = require('fs');
const app = express();
const SAFE_DIRECTORY = '/path/to/safe/directory';
app.get('/getFile', (req, res) => {
const fileName = req.query.fileName;
if (!fileName.startsWith(SAFE_DIRECTORY)) {
// Not allowed to read
return res.status(403).send('Access denied');
}
fs.readFile(fileName, 'utf8', (err, data) => {
if (err) {
res.status(500).send(err.toString());
} else {
res.send(data);
}
});
});
app.listen(3000);
- Data source directly reaches the taint, or passes through multiple processes to reach the taint (actually two types)
const express = require('express');
const fs = require('fs');
const path = require('path');
const app = express();
app.get('/getLog', (req, res) => {
const logFile = req.query.logFile;
const logPath = path.join(__dirname, 'logs', logFile); // Passed through path.join
fs.readFile(logPath, 'utf8', (err, data) => {
if (err) {
res.status(500).send(err.toString());
} else {
res.send(data);
}
});
});
app.listen(3000);
Thus, we need to do additional processing; in addition to defining source and sink, we also need to add barrier, sanitizer, and AdditionalTaintStep.
- filter: harmless processing (sanitizer), representing the use of data encryption or removal of harmful operations to ensure that data propagation no longer poses a threat to the software system's information security
- barrier: unlike sanitizer, a barrier does not clean or change data but serves as a condition check or decision point to block certain paths of data flow.
- AdditionalTaintStep: the source may disconnect during transmission, requiring manual connection
Writing Taint Analysis Rules#
CodeQL provides two ways to analyze:
- Static rules, i.e., AST tree queries, presented in terms of AST tree nodes, belonging to static analysis
- Dynamic rules, i.e., data flow DataFlow queries, abstracted as
DataFlow::Node
base class, belonging to dynamic analysis
The AST tree is easy to understand; DataFlow
can be a bit abstract, mainly divided into several different nodes:
DataFlow::Node
can represent any element in the program, the base class, i.e., tokens in lexical analysis
var x = 10; // 'x' and '10' can both be instances of DataFlow::Node
var y = x + 5; // 'y', 'x + 5', 'x', and '5' are also instances of DataFlow::Node
DataFlow::ValueNode
is used to represent values or expressions in the program
var name = "Alice"; // 'name' and "Alice" are both instances of DataFlow::ValueNode
function greet() {
return "Hello, " + name; // 'return "Hello, " + name;' is an instance of DataFlow::ValueNode
}
DataFlow::SourceNode
represents input points in the program, such as user input, file reads, etc.DataFlow::SinkNode
represents points where sinks may existDataFlow::PathNode
is used for sensitive data analysis variables, with no special purpose
The general process for writing rules is:
- Define the source, i.e., what the input data source is, which can be understood as externally input data
- Define the sink, i.e., tainted code, such as
readFile
- Define isBarrier, isSanitizer, isAdditionalTaintStep (not required, just to reduce false positives and negatives)
The general framework is as follows:
import javascript
import DataFlow::PathGraph
import Express
class FileReadFromUserInput extends TaintTracking::Configuration {
FileReadFromUserInput() { this = "FileReadFromUserInput" }
override predicate isSource(DataFlow::Node source) {
// Define Source
}
override predicate isSink(DataFlow::Node sink) {
// Define Sink
}
}
CodeQL uses a function-like syntax combined with AST syntax tree queries to mark points, such as marking the sink point at eval
override predicate isSink(DataFlow::Node sink) {
// Eval is a sink
exists(CallExpr call |
call.getCalleeName() = "eval" and
sink.asExpr() = call.getArgument(0)
)
}
Finally, the query:
from FileReadFromUserInput cfg, DataFlow::PathNode source, DataFlow::PathNode sink
where cfg.hasFlowPath(source, sink)
select sink.getNode(), source, sink, "$@ flows to $@ ", source.getNode(), source.toString(), sink.getNode(), sink.toString()
Source#
In the express framework, the source is generally fixed, i.e., req.query.xxxx
is the externally input data.
The writing of the source is essentially through AST syntax tree queries. For existing frameworks like express, CodeQL has pre-written rules that can be used directly.
override predicate isSource(DataFlow::Node source) {
exists(Express::RouteHandler rh, DataFlow::SourceNode sn |
sn = rh.getARequestSource() and
source = sn.getAPropertyRead("query").getAPropertyRead()
)
}
exists
is also a predicate, introduced here for simplicity to create temporary variablesExpress::RouteHandler
is a built-in rule that can find the corresponding route handling code in express
Note that the =
should be read from right to left; it essentially means assignment.
After writing, you can immediately select to see the results:
Sink#
Since this is handling arbitrary file read vulnerabilities, the sink is the readFile
call we initially wrote. Additionally, it is important to annotate the corresponding sink point.
The first argument of fs.readFile
is a potentially controllable point, so through
sink.asExpr() = call.getArgument(0)
(the sink here needs to be type-converted, as DataFlow and AST nodes are two models)
override predicate isSink(DataFlow::Node sink) {
exists(CallExpr call |
call.getCalleeName() = "readFile" and
sink.asExpr() = call.getArgument(0)
)
}
AdditionalTaintStep#
However, in this case, there will be vulnerabilities that cannot be found because it has passed through path.join
, and the source has been transmitted.
const express = require('express');
const fs = require('fs');
const path = require('path');
const app = express();
app.get('/getLog', (req, res) => {
const logFile = req.query.logFile;
const logPath = path.join(__dirname, 'logs', logFile);
fs.readFile(logPath, 'utf8', (err, data) => {
if (err) {
res.status(500).send(err.toString());
} else {
res.send(data);
}
});
});
app.listen(3000);
So we need to write an additional isAdditionalTaintStep
, which means that if we encounter a function like join
, we can consider that the source has undergone a transmission, and subsequent nodes can continue to connect.
override predicate isAdditionalTaintStep(DataFlow::Node pred, DataFlow::Node succ) {
exists(CallExpr call |
call.getCalleeName() = "join" and (
pred.asExpr() = call.getAnArgument() and succ.asExpr() = call
)
)
}
Final Summary of Rules and Result Analysis#
The final rule summary still has some false positives, as it does not handle sanitizers.
/**
* @name file-read-from-user-input
* @kind path-problem
* @problem.severity warning
* @tags correctness
* @id js
*/
import javascript
import DataFlow::PathGraph
import Express
class FileReadFromUserInput extends TaintTracking::Configuration {
FileReadFromUserInput() { this = "FileReadFromUserInput" }
override predicate isSource(DataFlow::Node source) {
exists(Express::RouteHandler rh, DataFlow::SourceNode sn |
sn = rh.getARequestSource() and
source = sn.getAPropertyRead("query").getAPropertyRead()
)
}
override predicate isSink(DataFlow::Node sink) {
exists(CallExpr call |
call.getCalleeName() = "readFile" and
sink.asExpr() = call.getArgument(0)
)
}
override predicate isAdditionalTaintStep(DataFlow::Node pred, DataFlow::Node succ) {
exists(CallExpr call |
call.getCalleeName() = "join" and (
pred.asExpr() = call.getAnArgument() and succ.asExpr() = call
)
)
}
}
from FileReadFromUserInput cfg, DataFlow::PathNode source, DataFlow::PathNode sink
where cfg.hasFlowPath(source, sink)
select sink.getNode(), source, sink, "$@ flows to $@ ", source.getNode(), source.toString(), sink.getNode(), sink.toString()
The results of the select can also show the data flow direction