コードアナライザーを数時間で開発する方法-＆＃＆チュートリアル-php.cn

静的分析は、開発者がコードの品質を管理するのに役立つ強力なツールです。 Java を使用して Lua 用の簡単なアナライザーを開発して、静的アナライザーの内部に何があるか見てみましょう。

How to develop code analyzer in hours

小さな序文

Java で Lua の静的アナライザーを作成します。

なぜ Lua なのか?構文はシンプルなので、詳細にこだわる必要はありません。 ~~特に JavaScript と比較すると、優れた言語でもあります~~。なぜ Java なのか? Java は私たちにとって万能な技術スタックを備えており、開発にはユーザーフレンドリーな言語でもあります。

いいえ :) ただし、この記事では、社内ハッカソンで本物のアナライザーパイロットを開発した方法について説明しています。つまり、タイトルは単なるクリックベイトではありません。実際には 48 時間かかりました。私たちのチームには 3 人の開発者がいたため、これをソロで試したい場合は、もう少し時間がかかることを覚悟してください。

これは単なるパイロットであるため、記事内ではアナライザーを「Mun」と呼びます。

アナライザーとは何ですか?

始める前に、アナライザーとは何かを理解し、作業範囲を計画することをお勧めします。全体として、これは明らかです。コードを取得して、ここに何か問題がある場合は文句を言いましょう。いったい何が必要なのでしょうか？私たちは静的解析の次の側面に興味があります:

レクサーとパーサー。ソースコードを取得して、使いやすいツリー (AST) に変換します。
AST (または抽象構文ツリー) は、プログラムのデータ構造をツリーとして表現する方法です。プログラムの構文に関するデータが含まれています。
意味論的なデータ。構文データだけでは分析には十分ではありません。したがって、セマンティック (意味) データをツリーに集約する追加のメカニズムが必要です。たとえば、変数スコープである可能性があります。
データフロー分析。より詳細な分析を行いたい場合は、プログラムノード内の変数値の予測を試みることができます。たとえば、ゼロ除算に関連するエラーを検出するのに役立ちます。

くどいようですが、これはアナライザーのコアだけです。ところで、コンパイラのフロントエンドにも同じものがあります。ただし、最も注意が必要な部分 (コード生成) はバックエンドに潜んでいます。

コアの計画は何ですか?範囲は次のとおりです:

コード内のエラーを検出する診断ルールを作成します;
検出されたエラーを収集し、レポートを発行します。
警告を表示するための独自のプラグインを作成します (オプション)。

確かに、48 時間では多すぎます:) いくつかのものは放棄し、いくつかは合理化し、いくつかは再利用する必要があります。記事を進めながら、そのようなケースについて検討していきます。

より深く理解するために、図式化してみましょう:

How to develop code analyzer in hours

始めるにはこれで十分です。静的分析について詳しくは、当社の Web サイトをご覧ください。

コア

レクサーとパーサー

理論

つまり、レクサーとパーサーはアナライザーの基礎であり、これらなしでは先に進むことはできません。正規表現以外のものが必要な場合。

レクサー を使用すると非常に簡単です。ソースコードをテキストとして扱うのは面倒です。したがって、それを中間表現に変換する、つまりトークンに分割する必要があります。とはいえ、レクサーは賢くなくてはなりません。最も困難で厄介なものはすべてパーサーに送られるため、これは必須です。

それでは、パーサーに移りましょう。受信したトークンを受け取り、それを計算して、AST を構築します。ここで簡単な背景を説明します。

言語には文法があり、それらを操作するさまざまな種類のパーサーがあります。自分たちで記述する場合は、文脈自由文法用の単純な LL(1) パーサーを作成した方がよいでしょう。前に述べたように、Lua は直感的な文法を備えているため、これで十分です。

LL は、入力文字列が左から右に読み取られ、それに対して左から右の出力が構築されることを意味します。一般に、現在のトークン (LL(0)) を確認するだけでは十分ではないため、k 個先のトークンを確認する必要がある場合があります。このようなパーサーは LL(k) と呼ばれます。

ただし、何も書くつもりはないので、これは必要ありません。なぜ？パーサーを作成する時間は 48 時間しかなく、特に開発したことがない場合は時間がありません。

選択されたアプローチ

レクサーとパーサーを作成するにはどのような代替手段があるでしょうか?発電機！これは、パーサーを生成するための入力として特別に記述された文法ファイルを与えることだけが必要なユーティリティのクラス全体です。

ANTLR v4 を選択しました。このツールは Java で書かれているため、非常に使いやすいです。長年の開発を経て、非常にうまくいき始めました。

ここにも問題が潜んでいます。パーサー用の文法ファイルの書き方を知る必要があります。幸いなことに、GitHub で既製のオプションを見つけるのは難しくないので、ここからそのまま採用します。

ANTLR を使用してプロジェクトを構成したら、抽象構文ツリーの構築に進みましょう。

抽象構文ツリー

ツリーといえば、ANTLR はコード分析を視覚化するように適切に設計されています。たとえば、階乗計算は次のとおりです:

function fact (n)
  if n == 0 then
    return 1
  else
    return n * fact(n-1)
  end
end

print("enter a number:")
a = io.read("*number")
print(fact(a))

ログイン後にコピー

次のツリーを取得できます:

How to develop code analyzer in hours

分類に関してはおそらく解析木に近いと思います。ここで停止して作業を開始できます。

しかし、それを行わずに AST に変換します。なぜ？ Java チームとしては、アナライザーのツリーと同様のツリーを操作する方が簡単です。 Java 開発について詳しくは、こちらをご覧ください。 Spoon のサンプルクラス階層は次のようになります:

How to develop code analyzer in hours

手作業の作業は後回しにして、コードを書く時間になりました。コード全体を完全には示していません。大きくて見苦しいため、意味がありません。私たちの考え方を少しでも理解していただくために、興味のある方のためにネタバレとしていくつかのコードスニペットを残しておきます。

ツリートップから処理を開始します:

public void enterStart_(LuaParser.Start_Context ctx) {
    _file = new CtFileImpl();
    _file.setBlocks(new ArrayList<>());
    for (var chunk : ctx.children) {
        if (chunk instanceof LuaParser.ChunkContext) {
            CtBlock block = getChunk((LuaParser.ChunkContext)chunk);
            if (block == null)
                continue;

            block.setParent(_file);
            _file.getBlocks().add(block);
        }
    }
}

private CtBlock getChunk(LuaParser.ChunkContext ctx) {
    for (var block : ctx.children) {
        if (block instanceof LuaParser.BlockContext) {
            return getBlock((LuaParser.BlockContext)block);
        }
    }
    return  null;
}

private CtBlock getBlock(LuaParser.BlockContext ctx) {
    CtBlock block = new CtBlockImpl();
    block.setLine(ctx.start.getLine());
    block.setColumn(ctx.start.getCharPositionInLine());
    block.setStatements(new ArrayList<>());

    for (var statement : ctx.children) {
        if (statement instanceof LuaParser.StatContext) {
            var statements = getStatement((LuaParser.StatContext)statement);
            for (var ctStatement : statements) {
                ctStatement.setParent(block);
                block.getStatements().add(ctStatement);
            }
        }
    }
    return block;
}

ログイン後にコピー

単純にツリーを上から下まで通過し、途中でツリーを構築します。遅かれ早かれ、ターミナルノードに到着します。関数パラメータは次のとおりです:

private List<CtParameter> parseParameters(
    LuaParser.NamelistContext ctx,
    CtElement parent
) {
    var parameters = new ArrayList<CtParameter>();
    for (var child : ctx.children) {
        if (Objects.equals(child.toString(), ","))
            continue;

        var parameter = new CtParameterImpl();
        parameter.setParameterName(child.toString());
        parameter.setParent(parent);
        parameters.add(parameter);
    }
    return parameters;
}

ログイン後にコピー

これもそれほど難しいことではないようです。あるオブジェクトを別のオブジェクトに変えるだけです。コードのリストを作成します。その仕組みが明確になることを願っています。

率直に言って、このアプローチは長期的には大きな利益にはつながりません。テキストをツリーに変換する代わりに、あるツリーを別のツリーに変換します。どちらのタスクもかなり面倒です。どのようなオプションがありますか?

レクサーとパーサーを最初から作成できます。それは良いことですが、期限やスキルに制限がある場合はそうではありません。
必要な AST をすぐに出力するように ANTLR を構成できます。本当だとは思えませんが、まだ ANTLR を勉強する必要があり、これもかなりの時間の無駄です。
市場投入までの時間を短縮できるソリューション。手持ちのものを使って作業し、結果として得られるツリーをターゲットツリーに変換する必要があります。良くはないですが、耐えられます。
私たちはまったく変換しません。チームに Java アナライザー開発者がいなかったら、そうしていたでしょう。

AST のコード分析の例を提示しなかった場合、このセクションは明らかに不完全になります。スポイラーの下で、同じ階乗計算の美しい木の模様を調べることができます。

CtGlobal:
  CtFile:
    CtFunction: func
      Parameters:
        CtParameter: n
      CtBlock:
        CtIf:
          Condition:
            CtBinaryOperator: Equals
              Left:
                CtVariableRead: n
              Right:
                CtLiteral: 0
          Then block
            CtBlock:
              CtReturn:
                CtLiteral: 1    
          Else block
            CtBlock:
              CtReturn:
                CtBinaryOperator: Multiplication
                  Left:
                    CtVariableRead: n
                  Right:
                    CtInvocation: fact
                      Arguments:
                        CtBinaryOperator: Minus
                          Left:
                            CtVariableRead: n
                          Right:
                            CtLiteral: 1
    CtInvocation: print
      Arguments:
        CtLiteral: "enter a number:"
    CtAssignment:
      Left:
        CtVariableWrite: a
      Right:
        CtInvocation:
          CtFieldRead: 
            Target: io
            Field: read
          Arguments:
            CtParameter: "*number"
    CtInvocation: print
      Arguments:
        CtInvocation: fact
          Arguments:
            CtVariableRead: a

ログイン後にコピー

ここで用語について簡単にコメントして終わりにしましょう。私たちのエンティティをツリートランスレータと呼ぶ方が適切ですが、今はパーサーと呼ぶことにしましょう。

ビジター

次に、分析を実行し、診断ルールの構築に使用する機能を開発します。樹木横断です。全体として、ツリー反復の実装方法を理解するのは簡単です。ただし、ツリーノードを使用していくつかの便利な作業を行う必要があります。来訪者の模様がステージに登場します。

古典的な GoF 本に載っていたあの魅力的なパターンを覚えていますか?このパターンにはクールな実装が含まれていますが、かなり曖昧な使用シナリオが記載されています。そこで、実際の状況でどのように使用されるかを示すベンチマークケースを用意しました。記事を簡潔にするために、書籍内でどのように実装されるかについては説明しませんが、アナライザーでどのように実行するかを示します。

簡単なツリートラバースから始めましょう。 CtScanner クラスを定義し、単一のアイテムとそのコレクションに対して 2 つの scan メソッドを追加します。

public <T extends CtElement> void scan(T element) {
    if (element != null) {
        element.accept(this);
    }
}

public <T extends CtElement> void scan(List<T> elements) {
    if (elements != null) {
        for (var element : elements) {
            scan(element);
        }
    }
}

ログイン後にコピー

CtElement からの accept が見えますか?この例では、CtVisitable インターフェースを継承するクラスはすべて、void accept(CtAbstractVisitor Visitor) メソッドを実装します。 CtAbstractVisitor については後ほど説明します。ここでは、CtScanner がそこから継承されていることを知るだけで十分です。

CtAssignment での accept は次のようになります:

@Override
public void accept(CtAbstractVisitor visitor){
    visitor.visitCtAssignment(this);
}

ログイン後にコピー

はい、それは簡単です。ノードはビジターでメソッドを呼び出します。 CtScanner には、ツリーノードの各クラスのメソッドがあるはずです。

@Override
public void visitCtIf(CtIf ctIf) {
    scan(ctIf.getCondition());
    scan((CtStatement) ctIf.getThenStatement());
    scan((CtStatement) ctIf.getElseStatement());
}

@Override
public <T> void visitCtLiteral(CtLiteral<T> literal) {
}
@Override
public void visitCtStatement(CtStatement statement) {
}

// ....

ログイン後にコピー

さて、CtScanner から抽出されたインターフェイスである CtAbstractVisitor に戻りましょう。これにはツリーノードにアクセスするためのメソッドが含まれていますが、これらのメソッドのみが含まれており、scan メソッドは含まれていません。ビジターメソッドの実装では、将来のオーバーロードに備えてプレースホルダーを残すか (終端ノードの場合)、それに沿って再帰降下を実行してツリーノードの展開を続けます。続行するために知っておく必要があるのはこれだけです。

Semantic parsing

Introduction

It'd seem that's all with the core. For example, now our analyzer can catch simple errors like the variable self-assignment. We call it the signature analysis. To provide more advanced analysis, we'd like to obtain more data about what's going on in the code. The parser has done its job, it's time to create new entities.

So far, we've completely followed the way of compilers regarding of the analyzer structure, but now our path diverges. If Lua had a compiler, it'd start analysis to ensure that the code is syntactically correct. Although our tools will continue to overlap in features.

For our needs, let's create SemanticResolver for two purposes:

define the variable scope;
evaluate the variable type based on the duck typing.

However, we'll call it from the parser along with its traversal. No decorator this time.

To keep the app structure seamless, we'll contain all the semantic data in the same AST nodes. First, let's define the necessary properties in the CtVariableAccess interface:

CtBlock getScope();
void setScope(CtBlock block);

TypeKind getTypeKind();
void setTypeKind(TypeKind type);

ログイン後にコピー

Variable scope

Let's start with scope; this will be our key tool for defining the variable along with its name. First, we define the variable entity inside SemanticResolver. To keep it short, we'll show you only the interface, but the gist should be clear:

public static class Variable {
    public Variable(String identifier);
    public String getIdentifier();
    public CtBlock getScope();
    public void setScope(CtBlock block);
    public void setType(TypeKind type);
    public TypeKind getType();
    // Methods use only identifier
    @Override
    public boolean equals(Object o);
    @Override
    public int hashCode();
}

ログイン後にコピー

Let's also further define the stack for variable scopes:

private final Stack<Pair<CtBlock, HashSet<Variable>>> stack = new Stack<>();
private final CtGlobal global;
public SemanticResolver(CtGlobal global) {
    pushStack(global);
    this.global = stack.peek();
}

public void pushStack(CtBlock block) {
    stack.push(Pair.of(block, new HashSet<>()));
}
public void popStack() {
    stack.pop();
}

ログイン後にコピー

The stack entry consists of a tuple of scope and variables set registered there. The stack operation is mundane. Here's how it works in the parser:

private CtBlock getBlock(LuaParser.BlockContext ctx) {
    CtBlock block = new CtBlockImpl();
    resolver.pushStack(block);

    // ....

    resolver.popStack();
    return block;
}

ログイン後にコピー

We have to register the variables somehow. If a variable is local, it's simple—let's take the current scope and pass the variable there:

public CtBlock registerLocal(Variable variable) {
    var scope = stack.pop();
    variable.setScope(scope.getLeft());
    scope.getRight().add(variable);
    stack.push(scope);
    return scope.getLeft();
}

ログイン後にコピー

If the local keyword isn't used, the variable will be either global or be declared somewhere above. So, we first go through the stack and double-check if it exists:

public CtBlock registerUndefined(Variable variable) {
    var pair = lookupPair(variable);
    pair.getRight().add(variable);
    return pair.getLeft();
}

public Pair<CtBlock, HashSet<Variable>> lookupPair(Variable variable) {
    var buf = new Stack<Pair<CtBlock, HashSet<Variable>>>();
    Pair<CtBlock, HashSet<Variable>> res = null;
    while (!stack.isEmpty()) {
        var scope = stack.pop();
        buf.push(scope);
        if (scope.getRight().contains(variable)) {
            res = scope;
            break;
        }
    }

    while (!buf.isEmpty()) {
        stack.push(buf.pop());
    }

    if (res == null) {
        return global;
    }

    return res;
}

ログイン後にコピー

We can set the scope to variables when we have the entry:

private CtVariableWrite getVariableWriteInternal(
        ParseTree ctx,
        boolean isLocal
) {
    var node = new CtVariableWriteImpl();
    node.setVariableName(ctx.getChild(0).toString());
    CtBlock scope;
    if (isLocal) {
        scope = resolver.registerLocal(
             new SemanticResolver.Variable(node.getVariableName()));
    } else {
        scope = resolver.registerUndefined(
            new SemanticResolver.Variable(node.getVariableName()));
    }
    node.setScope(scope);
    return node;
}

ログイン後にコピー

And defining it for readings:

private CtExpression getExpression(LuaParser.ExpContext ctx) {
    // ....
    if (child instanceof LuaParser.PrefixexpContext) {
        // ....
        var scope = resolver.lookupScope(
            new SemanticResolver.Variable(variableRead.getVariableName())
        );
        variableRead.setScope(scope);

        return variableRead;
    }
    // ....
}

ログイン後にコピー

We won't show the lookupScope code. It's a single-line wrapper over lookupPair, which we can see above. Here, we can end with the variable scope. We'll check the mechanism in the diagnostic rule on a separate section. Now we continue working on the semantic parsing. Now, let's move to the variable type determination.

Duck typing

How to determine the variable type? Indeed, we obtain them from the literals. Let's further define the type and the enumeration for them:

public interface CtLiteral<T> extends CtExpression, CtVisitable {
  // ....
  void setTypeKind(TypeKind kind);
  TypeKind getTypeKind();
}

ログイン後にコピー

public enum TypeKind {
    Undefined,
    Number,
    String,
    Boolean,
    Nil
}

ログイン後にコピー

Thus, the data type can be numeric, string, logical, and nil. However, it'll be undefined by default. The split of undefined and nil may seem far-fetched, but it's okay for the pilot.

We store the literal type only in the tree node, doing it from the parser:

private <T> CtLiteralImpl<T> createLiteral(
    // ....
    TypeKind type,
) {
    // ....

    literal.setTypeKind(type);
    return literal;
}

ログイン後にコピー

However, the variable type will be both in the tree and in SemanticResolver. So, we can request it during further traversal and the AST building:

private ArrayList<CtAssignment> parseAssignments(LuaParser.StatContext ctx) {
    // ....
    for (int i = 0; i < variables.size(); i++) {
        var assignment = new CtAssignmentImpl();
        var variable = variables.get(i);

        // ....


        variable.setTypeKind(resolver.lookupType(variable.getVariableName()));
        resolver.setType(
            variable.getVariableName(),
            variable.getScope(),
            SemanticResolver.evaluateExpressionType(expression)
        );
    }
    return assignments;
}

ログイン後にコピー

There is no error in the operator precedence. Let the stored variable have the type from its past assignment. It'll facilitate our work in the future. As for the methods used here, there's nothing incredible about the lookupType implementation—it's basically the same as lookupPair. There's nothing complex about setType:

public void setType(String variable, CtBlock scope, TypeKind type) {
    var opt = stack.stream()
             .filter(x -> Objects.equals(x.getLeft(), scope))
             .findFirst();
    if (opt.isPresent()) {
        var pair = opt.get();
        var newVar = new Variable(variable);
        var meta = pair.getRight()
                       .stream()
                       .filter(x -> x.equals(newVar))
                       .findFirst();
        meta.ifPresent(value -> value.setType(type));
    }
}

ログイン後にコピー

However, evaluateExpressionType is trickier. Computing variable types in dynamic languages can be a bit of a hassle. Just look at the jokes about JavaScript and string concatenation. However, firstly, Lua has a separate operator '..', and secondly, we're trying to keep the process simple. So, we'll only determine whether all the operands are the same type. We'll use the familiar CtScanner.

public static TypeKind evaluateExpressionType(CtExpression expression) {
    Mutable<TypeKind> type = new MutableObject<>(null);
    var typeEvaluator = new CtScanner() {
        private boolean stop = false;
        @Override
        public void scan(CtElement el) {
            if (stop) { return; }

            if (el instanceof CtVariableRead || el instanceof CtLiteral<?>) {
                var newType = el instanceof CtVariableRead
                              ? ((CtVariableRead) el).getTypeKind()
                              : ((CtLiteral<?>) el).getTypeKind();

                if (newType.equals(TypeKind.Undefined)) {
                    type.setValue(TypeKind.Undefined);
                    stop = true;
                    return;
                } else if (type.getValue() == null) {
                    type.setValue(newType);
                } else if (!type.getValue().equals(newType)) {
                    type.setValue(TypeKind.Undefined);
                    stop = true;
                    return;
                }
            }
            super.scan(el);
        }
    };
    typeEvaluator.scan(expression);
    return type.getValue();
}

ログイン後にコピー

In parseAssignments, we've set the assignment type to the (CtVariableWrite) variable but forgot about reading (CtVariableRead). Let's fix it:

private CtExpression getExpression(LuaParser.ExpContext ctx) {
    // ....
    if (child instanceof LuaParser.PrefixexpContext) {
        // ....
        variableRead.setTypeKind(
            resolver.lookupType(variableRead.getVariableName())
        );
        var scope = resolver.lookupScope(
            new SemanticResolver.Variable(variableRead.getVariableName()));
        variableRead.setScope(scope);

        return variableRead;
    }
    // ....
}

ログイン後にコピー

We've completed the semantic analysis and almost ready to start searching for bugs.

Data-flow analysis

Inside structure

First, we make two quick stops. While the topic of the data-flow analysis deserves a series of articles, it'd be wrong to skip over it. Here, we won't dive deep into theory but just try to memorize the values set by literals.

First, let's fall into the sin of self-copying and define the variable entity for DataFlow again but in a simpler way. Again, we'll only show you the interface:

private static class Variable {
    private Variable(String identifier, CtBlock scope);
    // Methods use identifier and scope
    @Override
    public boolean equals(Object o);
    @Override
    public int hashCode();
}

ログイン後にコピー

Here's the rest of the class content:

public class DataFlow {
    private static class Variable {
        // ....
    }
    Map<Variable, Object> variableCache = new HashMap<>();

    public void scanDataFlow(CtElement element) {
        if (element instanceof CtAssignment) {
            CtAssignment variableWrite = (CtAssignment) element;
            if (variableWrite.getAssignment() instanceof CtLiteral<?>) {
                var assigned = variableWrite.getAssigned();
                var variable = new Variable(
                    assigned.getVariableName(),
                    assigned.getScope()
                );
                variableCache.put(
                    variable, 
                    getValue(variableWrite.getAssignment())
                );
            }
        }
    }

    public Object getValue(CtExpression expression) {
        if (expression instanceof CtVariableRead) {
            CtVariableRead variableRead = (CtVariableRead) expression;
            var variable = new Variable(
                variableRead.getVariableName(),
                variableRead.getScope()
            );
            return variableCache.getOrDefault(variable, null);
        } else if (expression instanceof CtLiteral<?>) {
            return ((CtLiteral<?>) expression).getValue();
        }
        return null;
    }
}

ログイン後にコピー

It's quite simple: in scanDataFlow, we associate the value with the variable, and in getValue, we extract that value for the set node. Everything is simple because we don't factor in branching, loops, or even expressions. Why don't we factor them in? Branching is the very topic that deserves its own series of articles. What about expressions? Well, we didn't make it in two days. Given what we've achieved in the last two days, we think this task is feasible. We just postpone it as a homework.

That's all. It's clear that such a solution is far from a real product, but we have laid some foundation. Then we can either try to enhance the code—and we'll introduce data-flow analysis via AST. Or we can redo everything properly and build the control-flow graph.

We've done the class implementation, but we haven't discussed how to use it. We've written the class, but what shall we do with it? Let's talk about it in the following section. Now we just say that DataFlow works right before calling the diagnostic rules. Those are called when traversing the finished AST. Thus, the rules will have access to the current variable values. This is called Environment; you can see it in your debugger.

Walker

Welcome to the last section regarding the core. We've already had the AST that is full of semantic data, as well as the data-flow analysis that is just waiting to be run. It's a good time to put it all together and set the stage for our diagnostic rules.

How is the analysis performed? It's simple: we get on the topmost tree node, and then we start recursive traversal. That is, we need something that will traverse the tree. We have CtScanner for that. Based on it, we define MunInvoker:

public class MunInvoker extends CtScanner {
    private final List<MunRule> rules = new ArrayList<>();
    private final Analyzer analyzer;

    public MunInvoker(Analyzer analyzer) {
        this.analyzer = analyzer;
        rules.add(new M6005(analyzer));
        rules.add(new M6020(analyzer));
        rules.add(new M7000(analyzer));
        rules.add(new M7001(analyzer));
    }

    @Override
    public <T extends CtElement> void scan(T element) {
        if (element != null) {
            analyzer.getDataFlow().scanDataFlow(element);
            rules.forEach(element::accept);
            super.scan(element);
        }
    }
}

ログイン後にコピー

You can notice a few unknown things in the code:

the Analyzer class. It encapsulates the whole analysis process and contains shared resources that need to be accessed within the rules. In our case, this is the DataFlow instance. We'll get back to Analyzer later;
four confusing classes that are added to rules. We'll talk about them in the next section, so don't panic. A little spoiler for you :)

Otherwise, the class operation shouldn't raise any questions. Each time we enter any tree node, the analyzer rule is called, and the variable values are evaluated right before it. Next, the traversal continues in line with to the CtScanner algorithm.

Analysis

Preparing for writing diagnostic rules

Rule class

So, we have the analyzer core prototype—it's a good time to start analyzing something.

The base for our rules is ready—it's the CtAbstractVisitor class. The analysis goes as follows: the rule overloads a few visitors and analyzes the data contained in the AST nodes. Let's extend CtAbstractVisitor with the abstract MunRule class, which we use to create rules. In this class, we also define the addRule method that generates warnings.

Speaking of warnings: what data do they need? First is a warning message to demonstrate users what they may have been mistaken about. Second, the user needs to know where and what the analyzer warns. So, let's add data about the file where the analyzer has detected the troublesome code block and the location of this code fragment.

Here's what the MunRule class looks like:

public abstract class MunRule extends CtAbstractVisitor {
    private Analyzer analyzer;
    public void MunRule(Analyzer analyzer) {
        this.analyzer = analyzer;
    }
    protected Analyzer getAnalyzer() {
        return analyzer;
    }

    protected void addRule(String message, CtElement element) {
        var warning = new Warning();
        warning.message = message;

        WarningPosition pos = new WarningPosition(
            Analyzer.getFile(), 
            element.getLine(), 
            element.getColumn() + 1
        );
        warning.positions.add(pos);
        analyzer.addWarning(warning);
    }

    public DataFlow getDataFlow() {
        return analyzer.getDataFlow();
    }
}

ログイン後にコピー

The WarningPosition and Warning classes are just data stores, so we won't list them. We'll talk about addWarning now.

Merging it together

The last thing to prepare is how we're going to view the diagnostic rules. To do this, we combine all our features together, using the already mentioned Analyzer class. Actually, here it is:

public class Analyzer {
    private DataFlow dataFlow = new DataFlow();

    public DataFlow getDataFlow() {
        return dataFlow;
    }

    public CtElement getAst(String pathToFile) throws IOException {
        InputStream inputStream = new FileInputStream(pathToFile);
        Lexer lexer = new LuaLexer(CharStreams.fromStream(inputStream));

        ParseTreeWalker walker = new ParseTreeWalker();
        var listener = new LuaAstParser();
        walker.walk(listener, new LuaParser(
            new CommonTokenStream(lexer)
        ).start_());
        return listener.getFile();
    }

    protected void addWarning(Warning warning) {
        Main.logger.info(
            "WARNING: " + warning.code + " "
            + warning.message + " ("
            + warning.positions.get(0).line + ", "
            + warning.positions.get(0).column + ")");
      }      

    public MunInvoker getMunInvoker() {
        return new MunInvoker(this);
    }

    public void analyze(String pathToFile) {
        try {
            var top = getAst(pathToFile);
            var invoker = getMunInvoker();
            invoker.scan(top);
        }
        catch (IOException ex) {
            Main.logger.error("IO error: " + ex.getMessage());
        }
    }
}

ログイン後にコピー

To explain how it works, we'll give you an example of the whole analysis process:

In the getAst method, we build our AST using the scheme lexer —> parser —> tree translator;
Then we call MunIvoker, which traverses the tree and calls our diagnostic rules along with the data-flow analysis;
If necessary, the rules call the Analyzer class to get the DataFlow instance;
They call addWarning when the analyzer spots suspicious code fragment. That one just outputs the message to the log.

We've got the prep work—it's time to start writing diagnostic rules.

Writing diagnostic rules

Assigning a variable to itself

We've decided to start writing the rules with a simple one: PVS-Studio has the Java diagnostic rule, V6005, where the analyzer check if a variable is assigned to itself. It can simply be copied and slightly adapted to our tree. Since our analyzer is called Mun, we start the numbers of our diagnostic rules with M. Let's create the M6005 class, extending MunRule, and override the visitCtAssignment method in it. The following check will be located in the method:

public class M6005 extends MunRule {
    private void addRule(CtVariableAccess variable) {
        addRule("The variable is assigned to itself.", variable);
    }
    @Override
    public void visitCtAssignment(CtAssignment assignment) {
        if (RulesUtils.equals(assignment.getAssigned(), 
            assignment.getAssignment())) {
            addRule(assignment.getAssigned());
        }
    }
}

ログイン後にコピー

The used RulesUtils.equals method is a wrapper and overload for another equals method that checks the name and scope:

public static boolean equals(CtVariableAccess left, CtVariableAccess right) {
    return left.getVariableName().equals(right.getVariableName())
           && left.getScope().equals(right.getScope());
}

ログイン後にコピー

We need to check the scope, because the next code fragment isn't an assignment of the variable to itself:

local a = 5;
begin
    local a = a;
end

ログイン後にコピー

Now, we can test the diagnostic rule on some simple code and see if it works. The following example will issue the warning on the marked line: "M6005 The variable is assigned to itself":

local a = 5;
local b = 3;

if (b > a) then
    a = a;      <=
end

ログイン後にコピー

Zero division

Well, we've warmed up, so let's move on. The analyzer has already had the primitive data-flow analysis (DataFlow) that we can and should use. Again, let's look at one of our existing diagnostic rules, V6020, where the analyzer checks for the zero division. We try to adapt it for our analyzer. The divisor can be either the variable as zero or a literal as zero. We need to access the cache of variables to check their value.

Here's the simple implementation of such a diagnostic rule:

public class M6020 extends MunRule {
  private void addWarning(CtElement expression, String opText) {
    addRule(String.format(
            "%s by zero. Denominator '%s' == 0.",
            opText, expression instanceof CtLiteral
                ? ((CtLiteral) expression).getValue()
                : ((CtVariableRead) expression).getVariableName()
        ),
        expression
    );
  }

  @Override
  public void visitCtBinaryOperator(CtBinaryOperator operator) {
    BinaryOperatorKind opKind = operator.getKind();
    if (opKind != BinaryOperatorKind.DIV && opKind != BinaryOperatorKind.MOD) {
      return;
    }

    apply(operator.getRightHandOperand(), opKind == BinaryOperatorKind.MOD);
  }


  private void apply(CtExpression expr, boolean isMod) {
    Object variable = getDataFlow().getValue(expr);
    if (variable instanceof Integer) {
      if ((Integer) variable == 0) {
        String opText = isMod ? "Mod" : "Divide";
        addWarning(expr, opText);
      }
    }
  }
}

ログイン後にコピー

We can see that the diagnostic rule works on simple examples and issues the following warning: "M6020 Divide by zero. Denominator 'b' == 0" on these lines:

local a = 5;
local b = 0;
local c = a / b;   <=
local d = a / 0;   <=

ログイン後にコピー

If you have made the expression evaluator as your homework, you may try the diagnostic rule on this code:

local b = 7;
local b = b - 7;
local c = a / b;

ログイン後にコピー

Overwritten types

Since we're writing the Lua analyzer, we need to write diagnostic rules for this language. Let's start simple.

Lua is a dynamic scripting language. Let's use this feature and write the diagnostic rule that allows us to catch overwritten types.

We'll also need to pick a new number for the diagnostic rule. Earlier we just copied them from the Java analyzer. Now, it seems like it's time to start a new thousand—the seventh. Who knows what diagnostic rule it'll be in PVS-Studio, but at the time of writing this article, it will be Lua.

Knowing about types, we facilitate our case: we need to check that the left and right assignments are of different types. Note that we ignore Undefined for the left part and Nil for the right part. The code looks like this:

public class M7000 extends MunRule {
    @Override
    public void visitCtAssignment(CtAssignment assignment) {
        var assigned = assignment.getAssigned();
        var exprType = SemanticResolver.evaluateExpressionType(
            assignment.getAssignment());
        if (assigned.getTypeKind().equals(TypeKind.Undefined)
            || exprType.equals(TypeKind.Nil)
        ) {
            return;
        }

        if (!assigned.getTypeKind().equals(exprType)) {
            addRule(
                String.format(
                    "Type of the variable %s is overridden from %s to %s.",
                    assigned.getTypeKind().toString(),
                    exprType.toString()
                assigned,
            );
        }
    }
}

ログイン後にコピー

It's time to check the diagnostic rule in real cases. In the following example, our analyzer issues the warning only on the last line:

local a = "string";

if (true) then
  local a = 5;
end

a = 5;     <=

ログイン後にコピー

The analyzer warning: "M7000 Type of the variable a is overridden from Integer to String".

Lost local

Let's finish smoothly. The Lua plugin for VS Code has a diagnostic rule that detects global lowercase variables. This check can help detect the forgotten local identifiers. Let's implement the same diagnostic rule in our analyzer.

Here, just as before, we'll need to use the variable scope data that is obtained via the semantic analysis. We just find where the variable was declared. That's also where the variable is assigned a value. Then check its scope and name. If the variable is global and starts with a lowercase, the analyzer will warn. Easy-peasy.

Let's create a new class and override the visitCtAssignment method in it again. That way, we can look for the problematic global variables:

public class M7001 extends MunRule {
    @Override
    public void visitCtAssignment(CtAssignment assignment) {
        var variable = assignment.getAssigned();
        var firstLetter = variable.getVariableName().substring(0, 1);

        if (variable.getScope() instanceof CtGlobal && 
            !firstLetter.equals(firstLetter.toUpperCase())) {
            addRule("Global variable in lowercase initial.", variable);
        }
    }
}

ログイン後にコピー

Let's check the diagnostic rule work. It issues the warning: "M7001 Global variable in lowercase initial." It's issued on the second line in the code snippet below:

function sum_numbers(b, c)
    a = b + c;  <=
    return a;
end

local a = sum_numbers(10, 5);

ログイン後にコピー

Alright, we've written the diagnostic rules, we've done with the analyzer (and it even works). Breathe out. Now we can enjoy the fruits of our labors.

Viewing warnings

Above we've already shown the code that allows us to view warnings in a console or file. Here is how the process of working with the analyzer looks like in the console. We run the analyzer using the command:

java -jar mun-analyzer.jar -input "C:\munproject\test.lua"

And we get:

How to develop code analyzer in hours

However, first, the analyzer is a tool, and it should be user-friendly. Instead of working with the console, it'd be much more convenient to work with a plugin.

また借りる時が来ました。 PVS-Studio には、Visual Studio Code や IntelliJ IDEA など、さまざまな IDE 用のプラグインがすでにあります。警告を表示し、警告を参照することができます。アナライザーレポートの形式は標準化されているため、Java アナライザーから JSON レポート生成アルゴリズムを借用するだけで済みます。アルゴリズムは広範で退屈なため、ここでは示しません。コマンドラインから実行する必要がありますが、引数* -output "D:git/MunProject/report.json"* を使用します。

次に、IntelliJ IDEA または VS Code でレポートを開いて、Lua アナライザーの警告を確認します。

How to develop code analyzer in hours

甘い！これで、使いやすさを損なうことなく、アナライザーを本来の目的に沿って使用できるようになりました。

アド・アストラ

それで、本格的なアナライザーを作成できたでしょうか?ええと、正確には違います。この長い旅の終わりには、すべての段階を通過する最も本物のパイロットがいます。ただし、成長の余地は非常に大きいです:

コアの機能強化:
- データフロー分析を強化します。
- 制御フローを考慮してください。
- プロシージャ間およびモジュール間の分析を追加します。ここで C++ でどのように実行したかを読むことができます;
- データフロー分析とダックタイピングの強化に役立つ注釈メカニズムを追加します。
- より多くのセマンティックデータを提供します;
- 既存のメカニズムを微調整する;
- そして、より優れたパーサーについても忘れないでください。
診断ルールによる機能強化:
- 既存のものを強化します;
- さらに新しいものを書きます;
- 数十行を超えるさらに複雑な診断ルールを記述します。
アナライザーの使いやすさの強化:
- 適切なプラグインサポートを作成します;
- CI/CD に統合します。
開発および変更における診断ルールのパフォーマンスをチェックするための単体テストと回帰テスト。

その他にもたくさん。つまり、パイロットから本格的なツールに至るまでの道のりはかなり険しいのです。そのため、PVS-Studio は、新しい方向ではなく、既存の方向、つまり C#、C、C++、Java に焦点を当てています。これらの言語のいずれかで執筆している場合は、アナライザーを試してみてください。