如何在數小時內開發程式碼分析器-java教程-PHP中文網

靜態分析是一個強大的工具，可以幫助開發人員控製程式碼品質。讓我們試著使用 Java 為 Lua 開發一個簡單的分析器，看看靜態分析器的底層是什麼。

How to develop code analyzer in hours

小前言

我們將用 Java 寫 Lua 的靜態分析器。

為什麼是Lua？它的語法很簡單，沒有必要陷入細節。它也是一種很好的語言，~~尤其是與 JavaScript 相比~~。為什麼選擇Java？它為我們提供了全方位的技術棧，而且Java也是一種用戶友好的開發語言。

不:)但是，本文介紹了我們如何在內部黑客馬拉松中開發真正的分析儀試點。所以，這個標題不僅僅是一個標題誘餌——我們確實有 48 小時。我們的團隊有三名開發人員，所以如果您想單獨嘗試這個，請準備好花費更多時間。

由於這只是一個試點，我們在文章中將我們的分析器稱為「Mun」。

什麼是分析儀？

在開始之前，最好先了解分析器是什麼並規劃出工作範圍。總而言之，很明顯：抓住代碼，如果這裡有問題就抱怨。我們到底需要什麼？我們對靜態分析的以下方面感興趣：

詞法分析器和解析器。我們獲取原始程式碼並將其轉換為易於使用的樹（AST）。
AST（或抽象語法樹）是將程式的資料結構表示為樹的一種方式。它包含有關程式語法的資料。
語意資料。僅語法資料不足以進行分析。因此，我們需要一種額外的機制來聚合樹中的語義（含義）資料。例如，它可能是變數範圍。
資料流分析。如果我們想做一些深入的分析，我們可以嘗試預測程式節點中的變數值。例如，它有助於捕獲與零除法相關的錯誤。

聽起來很乏味，但這只是分析器的核心。順便說一句，編譯器在前端也有同樣的東西。然而，最棘手的部分（程式碼產生）就潛伏在後端。

核心的計畫是什麼？範圍如下：

編寫診斷規則來偵測程式碼中的錯誤；
收集偵測到的錯誤並發出報告；
建立我們自己的外掛程式來查看警告（可選）。

確實，48小時太多了：）有些東西必須放棄，有些東西要精簡，有些東西要重用。當我們閱讀本文時，我們將考慮此類情況。

為了更好地理解，讓我們將其圖解一下：

How to develop code analyzer in hours

這足以開始了。您可以在我們的網站上了解有關靜態分析的更多資訊。

核

詞法分析器和解析器

理論

所以，詞法分析器和解析器是分析器的基礎，沒有它們我們就無法走得更遠。如果我們想要正規表示式以外的任何東西。

使用詞法分析器非常簡單：將原始程式碼作為文字處理很麻煩。因此，我們需要將其轉換為中間表示，即將其拆分為標記。也就是說，詞法分析器必須不聰明。這是必須具備的，因為所有最困難、最混亂的東西都會進入解析器。

然後讓我們轉向解析器。它接收傳入的 token，計算出它們，然後建構 AST。這是一個簡單的背景知識。

語言有語法，並且有不同類型的解析器可以使用它們。如果我們自己編寫，最好為上下文無關語法編寫簡單的 LL(1) 解析器。正如我們之前所說，Lua 有直觀的語法，所以這就足夠了。

LL 表示從左到右讀取輸入字串，並為其建構從左到右的輸出。一般來說，僅僅查看當前標記 (LL(0)) 是不夠的，因此我們可能需要查看前面的 k 個標記。這樣的解析器稱為 LL(k)。

但是，我們不需要它，因為我們不會寫任何東西。為什麼？我們只有 48 小時，沒有時間創建解析器 - 特別是如果您從未開發過它。

選擇的方法

我們必須用什麼替代方案來寫我們的詞法分析器和解析器？發電機！這是一整類實用程序，只需要我們提供一個專門描述的語法檔案作為輸入來產生解析器。

我們選了 ANTLR v4。該工具也是用 Java 編寫的，這使得它非常易於使用。經過多年的發展，它已經開始表現得很好。

問題也潛伏在這裡——我們需要知道如何為解析器編寫語法檔案。幸運的是，在 GitHub 上找到現成的選項並不難，所以我們就從這裡獲取。

使用 ANTLR 設定專案後，讓我們繼續建立抽象語法樹。

抽象語法樹

說到樹：ANTLR 經過精心設計，可以視覺化程式碼分析。例如，以下是階乘計算：

function fact (n)
  if n == 0 then
    return 1
  else
    return n * fact(n-1)
  end
end

print("enter a number:")
a = io.read("*number")
print(fact(a))

登入後複製

我們可以得到以下樹：

How to develop code analyzer in hours

我們認為它可能更接近分類的解析樹。我們可以停在這裡並開始使用它。

但我們不會這樣做並將其轉換為 AST。為什麼？作為 Java 團隊，使用與分析器中的樹類似的樹會更容易。您可以在此處閱讀有關 Java 開發的更多資訊。 Spoon 的範例類別層次結構如下圖所示：

How to develop code analyzer in hours

好了，延遲手動工作已經足夠了，是時候編寫程式碼了。我們沒有完全展示整個程式碼——它沒有任何意義，因為它太大而且難看。為了讓您稍微了解我們的思維方式，我將在劇透下留下一些程式碼片段，供有興趣的人使用。

我們從樹頂開始處理：

public void enterStart_(LuaParser.Start_Context ctx) {
    _file = new CtFileImpl();
    _file.setBlocks(new ArrayList<>());
    for (var chunk : ctx.children) {
        if (chunk instanceof LuaParser.ChunkContext) {
            CtBlock block = getChunk((LuaParser.ChunkContext)chunk);
            if (block == null)
                continue;

            block.setParent(_file);
            _file.getBlocks().add(block);
        }
    }
}

private CtBlock getChunk(LuaParser.ChunkContext ctx) {
    for (var block : ctx.children) {
        if (block instanceof LuaParser.BlockContext) {
            return getBlock((LuaParser.BlockContext)block);
        }
    }
    return  null;
}

private CtBlock getBlock(LuaParser.BlockContext ctx) {
    CtBlock block = new CtBlockImpl();
    block.setLine(ctx.start.getLine());
    block.setColumn(ctx.start.getCharPositionInLine());
    block.setStatements(new ArrayList<>());

    for (var statement : ctx.children) {
        if (statement instanceof LuaParser.StatContext) {
            var statements = getStatement((LuaParser.StatContext)statement);
            for (var ctStatement : statements) {
                ctStatement.setParent(block);
                block.getStatements().add(ctStatement);
            }
        }
    }
    return block;
}

登入後複製

我們只需從上到下遍歷這棵樹，沿途建造我們的樹。遲早，我們會到達終端節點。以下是函數參數：

private List<CtParameter> parseParameters(
    LuaParser.NamelistContext ctx,
    CtElement parent
) {
    var parameters = new ArrayList<CtParameter>();
    for (var child : ctx.children) {
        if (Objects.equals(child.toString(), ","))
            continue;

        var parameter = new CtParameterImpl();
        parameter.setParameterName(child.toString());
        parameter.setParent(parent);
        parameters.add(parameter);
    }
    return parameters;
}

登入後複製

這似乎也不是很有挑戰性——我們只是將一個物體變成另一個物體。我們也將代碼列表包含在這裡。我們希望它的工作方式是清楚的。

坦白說，從長遠來看，這種方法不會帶來太大的利益。我們不是將文字轉換為樹，而是將一棵樹轉換為另一棵樹。這兩項任務都相當乏味。我們有什麼選擇？

我們可以從頭開始寫詞法分析器和解析器。這很好，但當我們受到截止日期和技能的限制時就不行了。
我們可以設定 ANTLR 以立即輸出所需的 AST。聽起來好得令人難以置信，但我們仍然需要研究 ANTLR，這也是極大的時間浪費。
解決方案快速上市。我們必須利用現有的資源，將生成的樹轉換為目標樹。雖然不好，但是還可以忍受。
我們根本不會轉換。如果團隊中沒有 Java 分析器開發人員，我們早就這麼做了。

如果我們沒有為 AST 提供程式碼分析範例，那麼本節顯然是不完整的。您可以在劇透下檢查樹的漂亮列印，以進行相同的階乘計算。

CtGlobal:
  CtFile:
    CtFunction: func
      Parameters:
        CtParameter: n
      CtBlock:
        CtIf:
          Condition:
            CtBinaryOperator: Equals
              Left:
                CtVariableRead: n
              Right:
                CtLiteral: 0
          Then block
            CtBlock:
              CtReturn:
                CtLiteral: 1    
          Else block
            CtBlock:
              CtReturn:
                CtBinaryOperator: Multiplication
                  Left:
                    CtVariableRead: n
                  Right:
                    CtInvocation: fact
                      Arguments:
                        CtBinaryOperator: Minus
                          Left:
                            CtVariableRead: n
                          Right:
                            CtLiteral: 1
    CtInvocation: print
      Arguments:
        CtLiteral: "enter a number:"
    CtAssignment:
      Left:
        CtVariableWrite: a
      Right:
        CtInvocation:
          CtFieldRead: 
            Target: io
            Field: read
          Arguments:
            CtParameter: "*number"
    CtInvocation: print
      Arguments:
        CtInvocation: fact
          Arguments:
            CtVariableRead: a

登入後複製

讓我們對術語進行快速評論以結束本文。雖然最好將我們的實體稱為樹翻譯器，但現在我們將其稱為解析器。

遊客

接下來，我們將開發執行分析並用於建立診斷規則的功能。這是一棵樹的遍歷。總的來說，很容易掌握如何實現樹迭代。然而，我們需要對樹節點做一些有用的事情。訪客模式已登場。

你還記得經典 GoF 書中那個引人入勝的模式，它有一個很酷的實現，但使用場景卻相當模糊？因此，我們有一個關於如何在實際情況中使用它的基準案例。為了保持文章簡潔，我不會在書中展示它是如何實現的，但我將展示我們如何在分析器中實現它。

讓我們從簡單的樹遍歷開始。我們定義 CtScanner 類，並為單一項目及其集合添加兩個 scan 方法。

public <T extends CtElement> void scan(T element) {
    if (element != null) {
        element.accept(this);
    }
}

public <T extends CtElement> void scan(List<T> elements) {
    if (elements != null) {
        for (var element : elements) {
            scan(element);
        }
    }
}

登入後複製

你看到這個來自CtElement的接受了嗎？在我們的例子中，任何繼承 CtVisitable 介面的類別都會實作 void accept(CtAbstractVisitor Visitor) 方法。稍後我們將討論 CtAbstractVisitor。現在，只要知道 CtScanner 是從它繼承的就夠了。

這就是 accept 在 CtAssignment:
中的樣子

@Override
public void accept(CtAbstractVisitor visitor){
    visitor.visitCtAssignment(this);
}

登入後複製

是的，這是小菜一碟。節點在訪客中呼叫它們的方法。在我們的CtScanner中應該有樹節點每個類別的方法：

@Override
public void visitCtIf(CtIf ctIf) {
    scan(ctIf.getCondition());
    scan((CtStatement) ctIf.getThenStatement());
    scan((CtStatement) ctIf.getElseStatement());
}

@Override
public <T> void visitCtLiteral(CtLiteral<T> literal) {
}
@Override
public void visitCtStatement(CtStatement statement) {
}

// ....

登入後複製

現在讓我們回到 CtAbstractVisitor，一個從 CtScanner 中提取的介面。它包含存取樹節點的方法，但只有這些方法，沒有 scan 方法。在訪客方法的實作中，我們要麼為將來的重載留下一個佔位符（如果它是終端節點），要麼我們透過沿著樹節點執行遞歸下降來繼續展開樹節點。這就是我們繼續需要知道的全部。

Semantic parsing

Introduction

It'd seem that's all with the core. For example, now our analyzer can catch simple errors like the variable self-assignment. We call it the signature analysis. To provide more advanced analysis, we'd like to obtain more data about what's going on in the code. The parser has done its job, it's time to create new entities.

So far, we've completely followed the way of compilers regarding of the analyzer structure, but now our path diverges. If Lua had a compiler, it'd start analysis to ensure that the code is syntactically correct. Although our tools will continue to overlap in features.

For our needs, let's create SemanticResolver for two purposes:

define the variable scope;
evaluate the variable type based on the duck typing.

However, we'll call it from the parser along with its traversal. No decorator this time.

To keep the app structure seamless, we'll contain all the semantic data in the same AST nodes. First, let's define the necessary properties in the CtVariableAccess interface:

CtBlock getScope();
void setScope(CtBlock block);

TypeKind getTypeKind();
void setTypeKind(TypeKind type);

登入後複製

Variable scope

Let's start with scope; this will be our key tool for defining the variable along with its name. First, we define the variable entity inside SemanticResolver. To keep it short, we'll show you only the interface, but the gist should be clear:

public static class Variable {
    public Variable(String identifier);
    public String getIdentifier();
    public CtBlock getScope();
    public void setScope(CtBlock block);
    public void setType(TypeKind type);
    public TypeKind getType();
    // Methods use only identifier
    @Override
    public boolean equals(Object o);
    @Override
    public int hashCode();
}

登入後複製

Let's also further define the stack for variable scopes:

private final Stack<Pair<CtBlock, HashSet<Variable>>> stack = new Stack<>();
private final CtGlobal global;
public SemanticResolver(CtGlobal global) {
    pushStack(global);
    this.global = stack.peek();
}

public void pushStack(CtBlock block) {
    stack.push(Pair.of(block, new HashSet<>()));
}
public void popStack() {
    stack.pop();
}

登入後複製

The stack entry consists of a tuple of scope and variables set registered there. The stack operation is mundane. Here's how it works in the parser:

private CtBlock getBlock(LuaParser.BlockContext ctx) {
    CtBlock block = new CtBlockImpl();
    resolver.pushStack(block);

    // ....

    resolver.popStack();
    return block;
}

登入後複製

We have to register the variables somehow. If a variable is local, it's simple—let's take the current scope and pass the variable there:

public CtBlock registerLocal(Variable variable) {
    var scope = stack.pop();
    variable.setScope(scope.getLeft());
    scope.getRight().add(variable);
    stack.push(scope);
    return scope.getLeft();
}

登入後複製

If the local keyword isn't used, the variable will be either global or be declared somewhere above. So, we first go through the stack and double-check if it exists:

public CtBlock registerUndefined(Variable variable) {
    var pair = lookupPair(variable);
    pair.getRight().add(variable);
    return pair.getLeft();
}

public Pair<CtBlock, HashSet<Variable>> lookupPair(Variable variable) {
    var buf = new Stack<Pair<CtBlock, HashSet<Variable>>>();
    Pair<CtBlock, HashSet<Variable>> res = null;
    while (!stack.isEmpty()) {
        var scope = stack.pop();
        buf.push(scope);
        if (scope.getRight().contains(variable)) {
            res = scope;
            break;
        }
    }

    while (!buf.isEmpty()) {
        stack.push(buf.pop());
    }

    if (res == null) {
        return global;
    }

    return res;
}

登入後複製

We can set the scope to variables when we have the entry:

private CtVariableWrite getVariableWriteInternal(
        ParseTree ctx,
        boolean isLocal
) {
    var node = new CtVariableWriteImpl();
    node.setVariableName(ctx.getChild(0).toString());
    CtBlock scope;
    if (isLocal) {
        scope = resolver.registerLocal(
             new SemanticResolver.Variable(node.getVariableName()));
    } else {
        scope = resolver.registerUndefined(
            new SemanticResolver.Variable(node.getVariableName()));
    }
    node.setScope(scope);
    return node;
}

登入後複製

And defining it for readings:

private CtExpression getExpression(LuaParser.ExpContext ctx) {
    // ....
    if (child instanceof LuaParser.PrefixexpContext) {
        // ....
        var scope = resolver.lookupScope(
            new SemanticResolver.Variable(variableRead.getVariableName())
        );
        variableRead.setScope(scope);

        return variableRead;
    }
    // ....
}

登入後複製

We won't show the lookupScope code. It's a single-line wrapper over lookupPair, which we can see above. Here, we can end with the variable scope. We'll check the mechanism in the diagnostic rule on a separate section. Now we continue working on the semantic parsing. Now, let's move to the variable type determination.

Duck typing

How to determine the variable type? Indeed, we obtain them from the literals. Let's further define the type and the enumeration for them:

public interface CtLiteral<T> extends CtExpression, CtVisitable {
  // ....
  void setTypeKind(TypeKind kind);
  TypeKind getTypeKind();
}

登入後複製

public enum TypeKind {
    Undefined,
    Number,
    String,
    Boolean,
    Nil
}

登入後複製

Thus, the data type can be numeric, string, logical, and nil. However, it'll be undefined by default. The split of undefined and nil may seem far-fetched, but it's okay for the pilot.

We store the literal type only in the tree node, doing it from the parser:

private <T> CtLiteralImpl<T> createLiteral(
    // ....
    TypeKind type,
) {
    // ....

    literal.setTypeKind(type);
    return literal;
}

登入後複製

However, the variable type will be both in the tree and in SemanticResolver. So, we can request it during further traversal and the AST building:

private ArrayList<CtAssignment> parseAssignments(LuaParser.StatContext ctx) {
    // ....
    for (int i = 0; i < variables.size(); i++) {
        var assignment = new CtAssignmentImpl();
        var variable = variables.get(i);

        // ....


        variable.setTypeKind(resolver.lookupType(variable.getVariableName()));
        resolver.setType(
            variable.getVariableName(),
            variable.getScope(),
            SemanticResolver.evaluateExpressionType(expression)
        );
    }
    return assignments;
}

登入後複製

There is no error in the operator precedence. Let the stored variable have the type from its past assignment. It'll facilitate our work in the future. As for the methods used here, there's nothing incredible about the lookupType implementation—it's basically the same as lookupPair. There's nothing complex about setType:

public void setType(String variable, CtBlock scope, TypeKind type) {
    var opt = stack.stream()
             .filter(x -> Objects.equals(x.getLeft(), scope))
             .findFirst();
    if (opt.isPresent()) {
        var pair = opt.get();
        var newVar = new Variable(variable);
        var meta = pair.getRight()
                       .stream()
                       .filter(x -> x.equals(newVar))
                       .findFirst();
        meta.ifPresent(value -> value.setType(type));
    }
}

登入後複製

However, evaluateExpressionType is trickier. Computing variable types in dynamic languages can be a bit of a hassle. Just look at the jokes about JavaScript and string concatenation. However, firstly, Lua has a separate operator '..', and secondly, we're trying to keep the process simple. So, we'll only determine whether all the operands are the same type. We'll use the familiar CtScanner.

public static TypeKind evaluateExpressionType(CtExpression expression) {
    Mutable<TypeKind> type = new MutableObject<>(null);
    var typeEvaluator = new CtScanner() {
        private boolean stop = false;
        @Override
        public void scan(CtElement el) {
            if (stop) { return; }

            if (el instanceof CtVariableRead || el instanceof CtLiteral<?>) {
                var newType = el instanceof CtVariableRead
                              ? ((CtVariableRead) el).getTypeKind()
                              : ((CtLiteral<?>) el).getTypeKind();

                if (newType.equals(TypeKind.Undefined)) {
                    type.setValue(TypeKind.Undefined);
                    stop = true;
                    return;
                } else if (type.getValue() == null) {
                    type.setValue(newType);
                } else if (!type.getValue().equals(newType)) {
                    type.setValue(TypeKind.Undefined);
                    stop = true;
                    return;
                }
            }
            super.scan(el);
        }
    };
    typeEvaluator.scan(expression);
    return type.getValue();
}

登入後複製

In parseAssignments, we've set the assignment type to the (CtVariableWrite) variable but forgot about reading (CtVariableRead). Let's fix it:

private CtExpression getExpression(LuaParser.ExpContext ctx) {
    // ....
    if (child instanceof LuaParser.PrefixexpContext) {
        // ....
        variableRead.setTypeKind(
            resolver.lookupType(variableRead.getVariableName())
        );
        var scope = resolver.lookupScope(
            new SemanticResolver.Variable(variableRead.getVariableName()));
        variableRead.setScope(scope);

        return variableRead;
    }
    // ....
}

登入後複製

We've completed the semantic analysis and almost ready to start searching for bugs.

Data-flow analysis

Inside structure

First, we make two quick stops. While the topic of the data-flow analysis deserves a series of articles, it'd be wrong to skip over it. Here, we won't dive deep into theory but just try to memorize the values set by literals.

First, let's fall into the sin of self-copying and define the variable entity for DataFlow again but in a simpler way. Again, we'll only show you the interface:

private static class Variable {
    private Variable(String identifier, CtBlock scope);
    // Methods use identifier and scope
    @Override
    public boolean equals(Object o);
    @Override
    public int hashCode();
}

登入後複製

Here's the rest of the class content:

public class DataFlow {
    private static class Variable {
        // ....
    }
    Map<Variable, Object> variableCache = new HashMap<>();

    public void scanDataFlow(CtElement element) {
        if (element instanceof CtAssignment) {
            CtAssignment variableWrite = (CtAssignment) element;
            if (variableWrite.getAssignment() instanceof CtLiteral<?>) {
                var assigned = variableWrite.getAssigned();
                var variable = new Variable(
                    assigned.getVariableName(),
                    assigned.getScope()
                );
                variableCache.put(
                    variable, 
                    getValue(variableWrite.getAssignment())
                );
            }
        }
    }

    public Object getValue(CtExpression expression) {
        if (expression instanceof CtVariableRead) {
            CtVariableRead variableRead = (CtVariableRead) expression;
            var variable = new Variable(
                variableRead.getVariableName(),
                variableRead.getScope()
            );
            return variableCache.getOrDefault(variable, null);
        } else if (expression instanceof CtLiteral<?>) {
            return ((CtLiteral<?>) expression).getValue();
        }
        return null;
    }
}

登入後複製

It's quite simple: in scanDataFlow, we associate the value with the variable, and in getValue, we extract that value for the set node. Everything is simple because we don't factor in branching, loops, or even expressions. Why don't we factor them in? Branching is the very topic that deserves its own series of articles. What about expressions? Well, we didn't make it in two days. Given what we've achieved in the last two days, we think this task is feasible. We just postpone it as a homework.

That's all. It's clear that such a solution is far from a real product, but we have laid some foundation. Then we can either try to enhance the code—and we'll introduce data-flow analysis via AST. Or we can redo everything properly and build the control-flow graph.

We've done the class implementation, but we haven't discussed how to use it. We've written the class, but what shall we do with it? Let's talk about it in the following section. Now we just say that DataFlow works right before calling the diagnostic rules. Those are called when traversing the finished AST. Thus, the rules will have access to the current variable values. This is called Environment; you can see it in your debugger.

Walker

Welcome to the last section regarding the core. We've already had the AST that is full of semantic data, as well as the data-flow analysis that is just waiting to be run. It's a good time to put it all together and set the stage for our diagnostic rules.

How is the analysis performed? It's simple: we get on the topmost tree node, and then we start recursive traversal. That is, we need something that will traverse the tree. We have CtScanner for that. Based on it, we define MunInvoker:

public class MunInvoker extends CtScanner {
    private final List<MunRule> rules = new ArrayList<>();
    private final Analyzer analyzer;

    public MunInvoker(Analyzer analyzer) {
        this.analyzer = analyzer;
        rules.add(new M6005(analyzer));
        rules.add(new M6020(analyzer));
        rules.add(new M7000(analyzer));
        rules.add(new M7001(analyzer));
    }

    @Override
    public <T extends CtElement> void scan(T element) {
        if (element != null) {
            analyzer.getDataFlow().scanDataFlow(element);
            rules.forEach(element::accept);
            super.scan(element);
        }
    }
}

登入後複製

You can notice a few unknown things in the code:

the Analyzer class. It encapsulates the whole analysis process and contains shared resources that need to be accessed within the rules. In our case, this is the DataFlow instance. We'll get back to Analyzer later;
four confusing classes that are added to rules. We'll talk about them in the next section, so don't panic. A little spoiler for you :)

Otherwise, the class operation shouldn't raise any questions. Each time we enter any tree node, the analyzer rule is called, and the variable values are evaluated right before it. Next, the traversal continues in line with to the CtScanner algorithm.

Analysis

Preparing for writing diagnostic rules

Rule class

So, we have the analyzer core prototype—it's a good time to start analyzing something.

The base for our rules is ready—it's the CtAbstractVisitor class. The analysis goes as follows: the rule overloads a few visitors and analyzes the data contained in the AST nodes. Let's extend CtAbstractVisitor with the abstract MunRule class, which we use to create rules. In this class, we also define the addRule method that generates warnings.

Speaking of warnings: what data do they need? First is a warning message to demonstrate users what they may have been mistaken about. Second, the user needs to know where and what the analyzer warns. So, let's add data about the file where the analyzer has detected the troublesome code block and the location of this code fragment.

Here's what the MunRule class looks like:

public abstract class MunRule extends CtAbstractVisitor {
    private Analyzer analyzer;
    public void MunRule(Analyzer analyzer) {
        this.analyzer = analyzer;
    }
    protected Analyzer getAnalyzer() {
        return analyzer;
    }

    protected void addRule(String message, CtElement element) {
        var warning = new Warning();
        warning.message = message;

        WarningPosition pos = new WarningPosition(
            Analyzer.getFile(), 
            element.getLine(), 
            element.getColumn() + 1
        );
        warning.positions.add(pos);
        analyzer.addWarning(warning);
    }

    public DataFlow getDataFlow() {
        return analyzer.getDataFlow();
    }
}

登入後複製

The WarningPosition and Warning classes are just data stores, so we won't list them. We'll talk about addWarning now.

Merging it together

The last thing to prepare is how we're going to view the diagnostic rules. To do this, we combine all our features together, using the already mentioned Analyzer class. Actually, here it is:

public class Analyzer {
    private DataFlow dataFlow = new DataFlow();

    public DataFlow getDataFlow() {
        return dataFlow;
    }

    public CtElement getAst(String pathToFile) throws IOException {
        InputStream inputStream = new FileInputStream(pathToFile);
        Lexer lexer = new LuaLexer(CharStreams.fromStream(inputStream));

        ParseTreeWalker walker = new ParseTreeWalker();
        var listener = new LuaAstParser();
        walker.walk(listener, new LuaParser(
            new CommonTokenStream(lexer)
        ).start_());
        return listener.getFile();
    }

    protected void addWarning(Warning warning) {
        Main.logger.info(
            "WARNING: " + warning.code + " "
            + warning.message + " ("
            + warning.positions.get(0).line + ", "
            + warning.positions.get(0).column + ")");
      }      

    public MunInvoker getMunInvoker() {
        return new MunInvoker(this);
    }

    public void analyze(String pathToFile) {
        try {
            var top = getAst(pathToFile);
            var invoker = getMunInvoker();
            invoker.scan(top);
        }
        catch (IOException ex) {
            Main.logger.error("IO error: " + ex.getMessage());
        }
    }
}

登入後複製

To explain how it works, we'll give you an example of the whole analysis process:

In the getAst method, we build our AST using the scheme lexer —> parser —> tree translator;
Then we call MunIvoker, which traverses the tree and calls our diagnostic rules along with the data-flow analysis;
If necessary, the rules call the Analyzer class to get the DataFlow instance;
They call addWarning when the analyzer spots suspicious code fragment. That one just outputs the message to the log.

We've got the prep work—it's time to start writing diagnostic rules.

Writing diagnostic rules

Assigning a variable to itself

We've decided to start writing the rules with a simple one: PVS-Studio has the Java diagnostic rule, V6005, where the analyzer check if a variable is assigned to itself. It can simply be copied and slightly adapted to our tree. Since our analyzer is called Mun, we start the numbers of our diagnostic rules with M. Let's create the M6005 class, extending MunRule, and override the visitCtAssignment method in it. The following check will be located in the method:

public class M6005 extends MunRule {
    private void addRule(CtVariableAccess variable) {
        addRule("The variable is assigned to itself.", variable);
    }
    @Override
    public void visitCtAssignment(CtAssignment assignment) {
        if (RulesUtils.equals(assignment.getAssigned(), 
            assignment.getAssignment())) {
            addRule(assignment.getAssigned());
        }
    }
}

登入後複製

The used RulesUtils.equals method is a wrapper and overload for another equals method that checks the name and scope:

public static boolean equals(CtVariableAccess left, CtVariableAccess right) {
    return left.getVariableName().equals(right.getVariableName())
           && left.getScope().equals(right.getScope());
}

登入後複製

We need to check the scope, because the next code fragment isn't an assignment of the variable to itself:

local a = 5;
begin
    local a = a;
end

登入後複製

Now, we can test the diagnostic rule on some simple code and see if it works. The following example will issue the warning on the marked line: "M6005 The variable is assigned to itself":

local a = 5;
local b = 3;

if (b > a) then
    a = a;      <=
end

登入後複製

Zero division

Well, we've warmed up, so let's move on. The analyzer has already had the primitive data-flow analysis (DataFlow) that we can and should use. Again, let's look at one of our existing diagnostic rules, V6020, where the analyzer checks for the zero division. We try to adapt it for our analyzer. The divisor can be either the variable as zero or a literal as zero. We need to access the cache of variables to check their value.

Here's the simple implementation of such a diagnostic rule:

public class M6020 extends MunRule {
  private void addWarning(CtElement expression, String opText) {
    addRule(String.format(
            "%s by zero. Denominator '%s' == 0.",
            opText, expression instanceof CtLiteral
                ? ((CtLiteral) expression).getValue()
                : ((CtVariableRead) expression).getVariableName()
        ),
        expression
    );
  }

  @Override
  public void visitCtBinaryOperator(CtBinaryOperator operator) {
    BinaryOperatorKind opKind = operator.getKind();
    if (opKind != BinaryOperatorKind.DIV && opKind != BinaryOperatorKind.MOD) {
      return;
    }

    apply(operator.getRightHandOperand(), opKind == BinaryOperatorKind.MOD);
  }


  private void apply(CtExpression expr, boolean isMod) {
    Object variable = getDataFlow().getValue(expr);
    if (variable instanceof Integer) {
      if ((Integer) variable == 0) {
        String opText = isMod ? "Mod" : "Divide";
        addWarning(expr, opText);
      }
    }
  }
}

登入後複製

We can see that the diagnostic rule works on simple examples and issues the following warning: "M6020 Divide by zero. Denominator 'b' == 0" on these lines:

local a = 5;
local b = 0;
local c = a / b;   <=
local d = a / 0;   <=

登入後複製

If you have made the expression evaluator as your homework, you may try the diagnostic rule on this code:

local b = 7;
local b = b - 7;
local c = a / b;

登入後複製

Overwritten types

Since we're writing the Lua analyzer, we need to write diagnostic rules for this language. Let's start simple.

Lua is a dynamic scripting language. Let's use this feature and write the diagnostic rule that allows us to catch overwritten types.

We'll also need to pick a new number for the diagnostic rule. Earlier we just copied them from the Java analyzer. Now, it seems like it's time to start a new thousand—the seventh. Who knows what diagnostic rule it'll be in PVS-Studio, but at the time of writing this article, it will be Lua.

Knowing about types, we facilitate our case: we need to check that the left and right assignments are of different types. Note that we ignore Undefined for the left part and Nil for the right part. The code looks like this:

public class M7000 extends MunRule {
    @Override
    public void visitCtAssignment(CtAssignment assignment) {
        var assigned = assignment.getAssigned();
        var exprType = SemanticResolver.evaluateExpressionType(
            assignment.getAssignment());
        if (assigned.getTypeKind().equals(TypeKind.Undefined)
            || exprType.equals(TypeKind.Nil)
        ) {
            return;
        }

        if (!assigned.getTypeKind().equals(exprType)) {
            addRule(
                String.format(
                    "Type of the variable %s is overridden from %s to %s.",
                    assigned.getTypeKind().toString(),
                    exprType.toString()
                assigned,
            );
        }
    }
}

登入後複製

It's time to check the diagnostic rule in real cases. In the following example, our analyzer issues the warning only on the last line:

local a = "string";

if (true) then
  local a = 5;
end

a = 5;     <=

登入後複製

The analyzer warning: "M7000 Type of the variable a is overridden from Integer to String".

Lost local

Let's finish smoothly. The Lua plugin for VS Code has a diagnostic rule that detects global lowercase variables. This check can help detect the forgotten local identifiers. Let's implement the same diagnostic rule in our analyzer.

Here, just as before, we'll need to use the variable scope data that is obtained via the semantic analysis. We just find where the variable was declared. That's also where the variable is assigned a value. Then check its scope and name. If the variable is global and starts with a lowercase, the analyzer will warn. Easy-peasy.

Let's create a new class and override the visitCtAssignment method in it again. That way, we can look for the problematic global variables:

public class M7001 extends MunRule {
    @Override
    public void visitCtAssignment(CtAssignment assignment) {
        var variable = assignment.getAssigned();
        var firstLetter = variable.getVariableName().substring(0, 1);

        if (variable.getScope() instanceof CtGlobal && 
            !firstLetter.equals(firstLetter.toUpperCase())) {
            addRule("Global variable in lowercase initial.", variable);
        }
    }
}

登入後複製

Let's check the diagnostic rule work. It issues the warning: "M7001 Global variable in lowercase initial." It's issued on the second line in the code snippet below:

function sum_numbers(b, c)
    a = b + c;  <=
    return a;
end

local a = sum_numbers(10, 5);

登入後複製

Alright, we've written the diagnostic rules, we've done with the analyzer (and it even works). Breathe out. Now we can enjoy the fruits of our labors.

Viewing warnings

Above we've already shown the code that allows us to view warnings in a console or file. Here is how the process of working with the analyzer looks like in the console. We run the analyzer using the command:

java -jar mun-analyzer.jar -input "C:\munproject\test.lua"

And we get:

How to develop code analyzer in hours

However, first, the analyzer is a tool, and it should be user-friendly. Instead of working with the console, it'd be much more convenient to work with a plugin.

又到了借錢的時間了。 PVS-Studio 已經為許多不同的 IDE 提供了插件，例如 Visual Studio Code 和 IntelliJ IDEA。您可以查看警告並瀏覽它們。由於分析器報告具有標準化格式，因此我們可以簡單地借用 Java 分析器的 JSON 報告生成演算法。該演算法廣泛且枯燥，因此我們不會展示它。我們仍然必須從命令列運行它，但使用參數* -output "D:git/MunProject/report.json"*.

然後，我們可以在 IntelliJ IDEA 或 VS Code 中開啟報表並查看 Lua 分析器警告：

How to develop code analyzer in hours

甜甜的！現在我們可以將分析器用於其全部預期目的，而無需犧牲可用性。

星際探索

那麼，我們已經寫了成熟的分析器嗎？呃，不完全是。在這段漫長旅程的終點，我們擁有了最真實的飛行員，經歷了所有階段。然而，成長空間是巨大的：

核心增強功能：
- 增強資料流分析；
- 考慮控制流；
- 新增過程間和模組間分析。您可以在這裡閱讀我們如何用 C++ 做到這一點；
- 加入註解機制以幫助增強資料流分析和鴨子類型；
- 提供更多語意資料；
- 微調現有機制；
- 並且不要忘記更好的解析器。
診斷規則的增強：
- 增強現有的；
- 寫更多新的；
- 寫出超出幾十行的更複雜的診斷規則。
分析器可用性增強：
- 創建適當的插件支援；
- 整合到 CI/CD 中。
單元測試和回歸測試，用於檢查診斷規則在開發和修改過程中的表現。