몇 시간 만에 코드 분석기를 개발하는 방법-java지도 시간-php.cn

집

Java

java지도 시간

몇 시간 만에 코드 분석기를 개발하는 방법

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Aug 14, 2024 pm 06:43 PM

정적 분석은 개발자가 코드 품질을 제어하는 데 도움이 되는 강력한 도구입니다. Java를 사용하여 간단한 Lua용 분석기를 개발하고 정적 분석기 내부에 무엇이 있는지 살펴보겠습니다.

How to develop code analyzer in hours

작은 서문

Lua용 정적 분석기를 Java로 작성하겠습니다.

왜 루아인가? 구문은 간단하므로 세부 사항에 대해 설명할 필요가 없습니다. ~~특히 JavaScript와 비교하면~~ 좋은 언어이기도 합니다. 왜 자바인가? Java는 우리에게 포괄적인 기술 스택을 갖추고 있으며, Java는 개발을 위한 사용자 친화적인 언어이기도 합니다.

아니요 :) 하지만 기사에서는 사내 해커톤에서 실제 분석기 파일럿을 개발한 방법에 대해 설명합니다. 따라서 제목은 단순한 클릭베이트가 아닙니다. 실제로 48시간이 있었습니다. 우리 팀에는 3명의 개발자가 있으므로 혼자서 해보고 싶다면 조금 더 시간을 할애할 준비를 하세요.

이것은 파일럿이므로 기사에서는 분석기를 "문"으로 지칭하겠습니다.

분석기란 무엇입니까?

시작하기 전에 분석기가 무엇인지 파악하고 작업 범위를 계획하는 것이 좋습니다. 대체로 명확합니다. 코드를 잡고 여기에 문제가 있으면 투덜대십시오. 우리에게 정확히 무엇이 필요한가요? 우리는 정적 분석의 다음 측면에 관심이 있습니다.

렉서와 파서. 소스 코드를 가져와 사용하기 쉬운 트리(AST)로 변환합니다.
AST(또는 추상 구문 트리)는 프로그램의 데이터 구조를 트리로 표현하는 방법입니다. 프로그램 구문에 대한 데이터가 포함되어 있습니다.
의미론적 데이터. 구문 데이터만으로는 분석에 충분하지 않습니다. 따라서 의미론적(의미) 데이터를 트리에 집계하는 추가 메커니즘이 필요합니다. 예를 들어 변수 범위가 될 수 있습니다.
데이터 흐름 분석. 심층 분석을 수행하려면 프로그램 노드의 변수 값을 예측해 볼 수 있습니다. 예를 들어 0 나눗셈과 관련된 오류를 잡는 데 도움이 됩니다.

지루하게 들리겠지만 이것은 분석기 핵심일 뿐입니다. 그런데 컴파일러의 프런트엔드에도 같은 것이 있습니다. 그러나 가장 까다로운 부분(코드 생성)은 백엔드에 숨어 있습니다.

코어의 계획은 무엇인가요? 범위는 다음과 같습니다.

코드 오류를 감지하기 위한 진단 규칙 작성
발견된 오류를 수집하고 보고서를 발행합니다.
경고를 보려면 자체 플러그인을 만드세요(선택 사항).

사실 48시간은 너무 많아요 :) 어떤 것들은 포기해야 하고, 어떤 것들은 간소화되어야 하고, 어떤 것들은 재사용되어야 합니다. 기사를 진행하면서 그러한 경우를 고려해 보겠습니다.

더 나은 이해를 돕기 위해 이를 도식화해 보겠습니다.

How to develop code analyzer in hours

시작하기에 충분합니다. 당사 웹사이트에서 정적 분석에 대해 자세히 알아볼 수 있습니다.

핵심

렉서와 파서

이론

따라서 렉서와 파서는 분석기의 기초이므로 이들 없이는 더 이상 나아갈 수 없습니다. 정규 표현식 이상의 것을 원한다면

렉서를 사용하면 매우 간단합니다. 소스 코드를 텍스트로 처리하는 것은 번거롭습니다. 따라서 이를 중간 표현으로 변환해야 합니다. 즉, 토큰으로 분할해야 합니다. 즉, 어휘 분석기는 똑똑하지 않아야 합니다. 가장 어렵고 지저분한 모든 작업이 파서에 전달되기 때문에 이는 필수입니다.

그럼 파서로 이동해 보겠습니다. 들어오는 토큰을 가져와서 파악하고 AST를 구축합니다. 간단한 배경 설명은 다음과 같습니다.

언어에는 문법이 있으며 이를 사용할 수 있는 다양한 유형의 파서가 있습니다. 우리가 직접 작성했다면 문맥 자유 문법을 위한 간단한 LL(1) 구문 분석기를 작성하는 것이 더 나을 것입니다. 앞서 말했듯이 루아는 직관적인 문법을 갖고 있으니 이 정도면 충분할 것 같습니다.

LL은 입력 문자열을 왼쪽에서 오른쪽으로 읽고 이에 대해 왼쪽에서 오른쪽으로 출력이 구성됨을 의미합니다. 일반적으로 현재 토큰 (LL(0)) 을 보는 것만으로는 충분하지 않으므로 앞으로 k 개의 토큰을 살펴봐야 할 수도 있습니다. 이러한 파서를 LL(k)라고 합니다.

하지만 아무것도 쓰지 않을 것이기 때문에 필요하지 않습니다. 왜? 파서를 생성할 시간은 48시간밖에 없으며, 특히 파서를 개발한 적이 없는 경우에는 더욱 그렇습니다.

선택된 접근법

렉서와 파서를 작성하려면 어떤 대안이 있어야 합니까? 발전기! 이것은 파서를 생성하기 위해 특별히 설명된 문법 파일을 입력으로 제공하기만 하면 되는 전체 유틸리티 클래스입니다.

우리는 ANTLR v4를 선택했습니다. 이 도구는 Java로도 작성되어 있어 사용하기 매우 쉽습니다. 수년간의 개발을 통해 매우 좋은 성과를 거두기 시작했습니다.

여기에도 문제가 숨어 있습니다. 파서용 문법 파일을 작성하는 방법을 알아야 합니다. 다행히 GitHub에서 미리 만들어진 옵션을 찾는 것이 어렵지 않으니 여기서는 그냥 가져가겠습니다.

ANTLR로 프로젝트를 구성한 후 추상 구문 트리 구축으로 넘어가겠습니다.

추상 구문 트리

나무 이야기: ANTLR은 코드 분석을 시각화하도록 잘 설계되었습니다. 예를 들어 계승 계산은 다음과 같습니다.

function fact (n)
  if n == 0 then
    return 1
  else
    return n * fact(n-1)
  end
end

print("enter a number:")
a = io.read("*number")
print(fact(a))

로그인 후 복사

다음 트리를 얻을 수 있습니다:

How to develop code analyzer in hours

분류 측면에서는 파스 트리에 더 가깝다고 생각합니다. 여기서 멈추고 작업을 시작해 보세요.

하지만 그렇게 하지 않고 AST로 변환하겠습니다. 왜? Java 팀으로서 우리 분석기의 트리와 유사한 트리로 작업하는 것이 더 쉬울 것입니다. 여기에서 Java 개발에 대한 자세한 내용을 읽을 수 있습니다. Spoon의 샘플 클래스 계층 구조는 다음과 같습니다.

How to develop code analyzer in hours

자, 이제 수작업을 미뤄두었다면 이제 코드를 작성할 차례입니다. 전체 코드를 완전히 보여주지는 않습니다. 코드가 크고 보기 흉하기 때문에 의미가 없습니다. 우리 사고 방식에 대한 약간의 통찰력을 제공하기 위해 관심 있는 분들을 위해 스포일러 아래 몇 가지 코드 조각을 남겨두겠습니다.

트리 상단부터 처리를 시작합니다.

public void enterStart_(LuaParser.Start_Context ctx) {
    _file = new CtFileImpl();
    _file.setBlocks(new ArrayList<>());
    for (var chunk : ctx.children) {
        if (chunk instanceof LuaParser.ChunkContext) {
            CtBlock block = getChunk((LuaParser.ChunkContext)chunk);
            if (block == null)
                continue;

            block.setParent(_file);
            _file.getBlocks().add(block);
        }
    }
}

private CtBlock getChunk(LuaParser.ChunkContext ctx) {
    for (var block : ctx.children) {
        if (block instanceof LuaParser.BlockContext) {
            return getBlock((LuaParser.BlockContext)block);
        }
    }
    return  null;
}

private CtBlock getBlock(LuaParser.BlockContext ctx) {
    CtBlock block = new CtBlockImpl();
    block.setLine(ctx.start.getLine());
    block.setColumn(ctx.start.getCharPositionInLine());
    block.setStatements(new ArrayList<>());

    for (var statement : ctx.children) {
        if (statement instanceof LuaParser.StatContext) {
            var statements = getStatement((LuaParser.StatContext)statement);
            for (var ctStatement : statements) {
                ctStatement.setParent(block);
                block.getStatements().add(ctStatement);
            }
        }
    }
    return block;
}

로그인 후 복사

우리는 위에서 아래로 트리를 통과하면서 길을 따라 트리를 만듭니다. 조만간 우리는 터미널 노드에 도달할 것입니다. 다음은 함수 매개변수입니다:

private List<CtParameter> parseParameters(
    LuaParser.NamelistContext ctx,
    CtElement parent
) {
    var parameters = new ArrayList<CtParameter>();
    for (var child : ctx.children) {
        if (Objects.equals(child.toString(), ","))
            continue;

        var parameter = new CtParameterImpl();
        parameter.setParameterName(child.toString());
        parameter.setParent(parent);
        parameters.add(parameter);
    }
    return parameters;
}

로그인 후 복사

또한 그다지 어려운 일도 아닌 것 같습니다. 단지 한 물건을 다른 물건으로 바꾸는 것뿐입니다. 코드 목록도 여기에 포함시켜 보겠습니다. 작동 방식이 명확하길 바랍니다.

솔직히 이 접근 방식은 장기적으로 큰 이익으로 이어지지 않습니다. 텍스트를 트리로 변환하는 대신 한 트리를 다른 트리로 변환합니다. 두 작업 모두 다소 지루합니다. 어떤 옵션이 있나요?

어휘분석기와 파서를 처음부터 작성할 수 있습니다. 좋습니다. 하지만 마감일과 기술에 제약이 있을 때는 그렇지 않습니다.
원하는 AST를 즉시 출력하도록 ANTLR을 구성할 수 있습니다. 사실이라고 하기에는 너무 좋은 것 같지만 여전히 ANTLR을 연구해야 하며 이는 상당한 시간 낭비이기도 합니다.
빠른 시장 출시 솔루션. 우리는 우리가 가지고 있는 것을 가지고 작업하고 결과 트리를 대상 트리로 변환해야 합니다. 좋지는 않지만 참을만 합니다.
우리는 전혀 전환하지 않습니다. 팀에 Java 분석기 개발자가 없었다면 그렇게 했을 것입니다.

AST에 대한 코드 분석 예시를 가져오지 않았다면 섹션이 분명히 불완전할 것입니다. 스포일러 아래에서 동일한 계승 계산을 위해 트리의 예쁜 프린트를 살펴볼 수 있습니다.

CtGlobal:
  CtFile:
    CtFunction: func
      Parameters:
        CtParameter: n
      CtBlock:
        CtIf:
          Condition:
            CtBinaryOperator: Equals
              Left:
                CtVariableRead: n
              Right:
                CtLiteral: 0
          Then block
            CtBlock:
              CtReturn:
                CtLiteral: 1    
          Else block
            CtBlock:
              CtReturn:
                CtBinaryOperator: Multiplication
                  Left:
                    CtVariableRead: n
                  Right:
                    CtInvocation: fact
                      Arguments:
                        CtBinaryOperator: Minus
                          Left:
                            CtVariableRead: n
                          Right:
                            CtLiteral: 1
    CtInvocation: print
      Arguments:
        CtLiteral: "enter a number:"
    CtAssignment:
      Left:
        CtVariableWrite: a
      Right:
        CtInvocation:
          CtFieldRead: 
            Target: io
            Field: read
          Arguments:
            CtParameter: "*number"
    CtInvocation: print
      Arguments:
        CtInvocation: fact
          Arguments:
            CtVariableRead: a

로그인 후 복사

용어에 대한 간단한 설명으로 마무리하겠습니다. 우리 엔터티를 트리 변환기라고 부르는 것이 더 좋지만 지금은 그냥 파서라고 부르겠습니다.

방문객

다음으로 분석을 수행하고 진단 규칙을 구축하는 데 사용하는 기능을 개발하겠습니다. 트리 순회입니다. 전반적으로 트리 반복을 구현하는 방법을 파악하기 쉽습니다. 그러나 트리 노드를 사용하여 몇 가지 유용한 작업을 수행해야 합니다. 방문자 패턴이 무대에 등장합니다.

멋진 구현이 있지만 다소 모호한 사용 시나리오가 있는 고전 GoF 책의 매혹적인 패턴을 기억하십니까? 그래서 실제 상황에서 어떻게 사용되는지 벤치마크 사례를 가지고 왔습니다. 기사를 간결하게 유지하기 위해 책에서는 구현 방법을 보여주지 않고 분석기에서 어떻게 구현하는지 보여드리겠습니다.

트리 순회부터 간단하게 시작해 보겠습니다. CtScanner 클래스를 정의하고 단일 항목과 해당 컬렉션에 대해 두 개의 scan 메서드를 추가합니다.

public <T extends CtElement> void scan(T element) {
    if (element != null) {
        element.accept(this);
    }
}

public <T extends CtElement> void scan(List<T> elements) {
    if (elements != null) {
        for (var element : elements) {
            scan(element);
        }
    }
}

로그인 후 복사

CtElement의 수락이 보이시나요? 우리의 경우 CtVisitable 인터페이스를 상속하는 모든 클래스는 void accept(CtAbstractVisitor Visitor) 메소드를 구현합니다. CtAbstractVisitor에 대해서는 잠시 후에 이야기하겠습니다. 이제 CtScanner가 이를 상속받았다는 것만 알면 충분합니다.

CtAssignment에서 accept는 다음과 같습니다.

@Override
public void accept(CtAbstractVisitor visitor){
    visitor.visitCtAssignment(this);
}

로그인 후 복사

네, 아주 쉽죠. 노드는 방문자에서 해당 메서드를 호출합니다. CtScanner에는 트리 노드의 각 클래스에 대한 메서드가 있어야 합니다.

@Override
public void visitCtIf(CtIf ctIf) {
    scan(ctIf.getCondition());
    scan((CtStatement) ctIf.getThenStatement());
    scan((CtStatement) ctIf.getElseStatement());
}

@Override
public <T> void visitCtLiteral(CtLiteral<T> literal) {
}
@Override
public void visitCtStatement(CtStatement statement) {
}

// ....

로그인 후 복사

이제 CtScanner에서 추출한 인터페이스인 CtAbstractVisitor로 돌아가 보겠습니다. 여기에는 트리 노드를 방문하는 방법이 포함되어 있지만 이러한 방법만 있고 스캔 방법은 없습니다. 방문자 메서드를 구현할 때 향후 오버로드를 위한 자리 표시자를 남겨두거나(터미널 노드인 경우) 재귀 하강을 수행하여 트리 노드를 계속해서 펼칩니다. 이것이 우리가 계속하기 위해 알아야 할 전부입니다.

Semantic parsing

Introduction

It'd seem that's all with the core. For example, now our analyzer can catch simple errors like the variable self-assignment. We call it the signature analysis. To provide more advanced analysis, we'd like to obtain more data about what's going on in the code. The parser has done its job, it's time to create new entities.

So far, we've completely followed the way of compilers regarding of the analyzer structure, but now our path diverges. If Lua had a compiler, it'd start analysis to ensure that the code is syntactically correct. Although our tools will continue to overlap in features.

For our needs, let's create SemanticResolver for two purposes:

define the variable scope;
evaluate the variable type based on the duck typing.

However, we'll call it from the parser along with its traversal. No decorator this time.

To keep the app structure seamless, we'll contain all the semantic data in the same AST nodes. First, let's define the necessary properties in the CtVariableAccess interface:

CtBlock getScope();
void setScope(CtBlock block);

TypeKind getTypeKind();
void setTypeKind(TypeKind type);

로그인 후 복사

Variable scope

Let's start with scope; this will be our key tool for defining the variable along with its name. First, we define the variable entity inside SemanticResolver. To keep it short, we'll show you only the interface, but the gist should be clear:

public static class Variable {
    public Variable(String identifier);
    public String getIdentifier();
    public CtBlock getScope();
    public void setScope(CtBlock block);
    public void setType(TypeKind type);
    public TypeKind getType();
    // Methods use only identifier
    @Override
    public boolean equals(Object o);
    @Override
    public int hashCode();
}

로그인 후 복사

Let's also further define the stack for variable scopes:

private final Stack<Pair<CtBlock, HashSet<Variable>>> stack = new Stack<>();
private final CtGlobal global;
public SemanticResolver(CtGlobal global) {
    pushStack(global);
    this.global = stack.peek();
}

public void pushStack(CtBlock block) {
    stack.push(Pair.of(block, new HashSet<>()));
}
public void popStack() {
    stack.pop();
}

로그인 후 복사

The stack entry consists of a tuple of scope and variables set registered there. The stack operation is mundane. Here's how it works in the parser:

private CtBlock getBlock(LuaParser.BlockContext ctx) {
    CtBlock block = new CtBlockImpl();
    resolver.pushStack(block);

    // ....

    resolver.popStack();
    return block;
}

로그인 후 복사

We have to register the variables somehow. If a variable is local, it's simple—let's take the current scope and pass the variable there:

public CtBlock registerLocal(Variable variable) {
    var scope = stack.pop();
    variable.setScope(scope.getLeft());
    scope.getRight().add(variable);
    stack.push(scope);
    return scope.getLeft();
}

로그인 후 복사

If the local keyword isn't used, the variable will be either global or be declared somewhere above. So, we first go through the stack and double-check if it exists:

public CtBlock registerUndefined(Variable variable) {
    var pair = lookupPair(variable);
    pair.getRight().add(variable);
    return pair.getLeft();
}

public Pair<CtBlock, HashSet<Variable>> lookupPair(Variable variable) {
    var buf = new Stack<Pair<CtBlock, HashSet<Variable>>>();
    Pair<CtBlock, HashSet<Variable>> res = null;
    while (!stack.isEmpty()) {
        var scope = stack.pop();
        buf.push(scope);
        if (scope.getRight().contains(variable)) {
            res = scope;
            break;
        }
    }

    while (!buf.isEmpty()) {
        stack.push(buf.pop());
    }

    if (res == null) {
        return global;
    }

    return res;
}

로그인 후 복사

We can set the scope to variables when we have the entry:

private CtVariableWrite getVariableWriteInternal(
        ParseTree ctx,
        boolean isLocal
) {
    var node = new CtVariableWriteImpl();
    node.setVariableName(ctx.getChild(0).toString());
    CtBlock scope;
    if (isLocal) {
        scope = resolver.registerLocal(
             new SemanticResolver.Variable(node.getVariableName()));
    } else {
        scope = resolver.registerUndefined(
            new SemanticResolver.Variable(node.getVariableName()));
    }
    node.setScope(scope);
    return node;
}

로그인 후 복사

And defining it for readings:

private CtExpression getExpression(LuaParser.ExpContext ctx) {
    // ....
    if (child instanceof LuaParser.PrefixexpContext) {
        // ....
        var scope = resolver.lookupScope(
            new SemanticResolver.Variable(variableRead.getVariableName())
        );
        variableRead.setScope(scope);

        return variableRead;
    }
    // ....
}

로그인 후 복사

We won't show the lookupScope code. It's a single-line wrapper over lookupPair, which we can see above. Here, we can end with the variable scope. We'll check the mechanism in the diagnostic rule on a separate section. Now we continue working on the semantic parsing. Now, let's move to the variable type determination.

Duck typing

How to determine the variable type? Indeed, we obtain them from the literals. Let's further define the type and the enumeration for them:

public interface CtLiteral<T> extends CtExpression, CtVisitable {
  // ....
  void setTypeKind(TypeKind kind);
  TypeKind getTypeKind();
}

로그인 후 복사

public enum TypeKind {
    Undefined,
    Number,
    String,
    Boolean,
    Nil
}

로그인 후 복사

Thus, the data type can be numeric, string, logical, and nil. However, it'll be undefined by default. The split of undefined and nil may seem far-fetched, but it's okay for the pilot.

We store the literal type only in the tree node, doing it from the parser:

private <T> CtLiteralImpl<T> createLiteral(
    // ....
    TypeKind type,
) {
    // ....

    literal.setTypeKind(type);
    return literal;
}

로그인 후 복사

However, the variable type will be both in the tree and in SemanticResolver. So, we can request it during further traversal and the AST building:

private ArrayList<CtAssignment> parseAssignments(LuaParser.StatContext ctx) {
    // ....
    for (int i = 0; i < variables.size(); i++) {
        var assignment = new CtAssignmentImpl();
        var variable = variables.get(i);

        // ....


        variable.setTypeKind(resolver.lookupType(variable.getVariableName()));
        resolver.setType(
            variable.getVariableName(),
            variable.getScope(),
            SemanticResolver.evaluateExpressionType(expression)
        );
    }
    return assignments;
}

로그인 후 복사

There is no error in the operator precedence. Let the stored variable have the type from its past assignment. It'll facilitate our work in the future. As for the methods used here, there's nothing incredible about the lookupType implementation—it's basically the same as lookupPair. There's nothing complex about setType:

public void setType(String variable, CtBlock scope, TypeKind type) {
    var opt = stack.stream()
             .filter(x -> Objects.equals(x.getLeft(), scope))
             .findFirst();
    if (opt.isPresent()) {
        var pair = opt.get();
        var newVar = new Variable(variable);
        var meta = pair.getRight()
                       .stream()
                       .filter(x -> x.equals(newVar))
                       .findFirst();
        meta.ifPresent(value -> value.setType(type));
    }
}

로그인 후 복사

However, evaluateExpressionType is trickier. Computing variable types in dynamic languages can be a bit of a hassle. Just look at the jokes about JavaScript and string concatenation. However, firstly, Lua has a separate operator '..', and secondly, we're trying to keep the process simple. So, we'll only determine whether all the operands are the same type. We'll use the familiar CtScanner.

public static TypeKind evaluateExpressionType(CtExpression expression) {
    Mutable<TypeKind> type = new MutableObject<>(null);
    var typeEvaluator = new CtScanner() {
        private boolean stop = false;
        @Override
        public void scan(CtElement el) {
            if (stop) { return; }

            if (el instanceof CtVariableRead || el instanceof CtLiteral<?>) {
                var newType = el instanceof CtVariableRead
                              ? ((CtVariableRead) el).getTypeKind()
                              : ((CtLiteral<?>) el).getTypeKind();

                if (newType.equals(TypeKind.Undefined)) {
                    type.setValue(TypeKind.Undefined);
                    stop = true;
                    return;
                } else if (type.getValue() == null) {
                    type.setValue(newType);
                } else if (!type.getValue().equals(newType)) {
                    type.setValue(TypeKind.Undefined);
                    stop = true;
                    return;
                }
            }
            super.scan(el);
        }
    };
    typeEvaluator.scan(expression);
    return type.getValue();
}

로그인 후 복사

In parseAssignments, we've set the assignment type to the (CtVariableWrite) variable but forgot about reading (CtVariableRead). Let's fix it:

private CtExpression getExpression(LuaParser.ExpContext ctx) {
    // ....
    if (child instanceof LuaParser.PrefixexpContext) {
        // ....
        variableRead.setTypeKind(
            resolver.lookupType(variableRead.getVariableName())
        );
        var scope = resolver.lookupScope(
            new SemanticResolver.Variable(variableRead.getVariableName()));
        variableRead.setScope(scope);

        return variableRead;
    }
    // ....
}

로그인 후 복사

We've completed the semantic analysis and almost ready to start searching for bugs.

Data-flow analysis

Inside structure

First, we make two quick stops. While the topic of the data-flow analysis deserves a series of articles, it'd be wrong to skip over it. Here, we won't dive deep into theory but just try to memorize the values set by literals.

First, let's fall into the sin of self-copying and define the variable entity for DataFlow again but in a simpler way. Again, we'll only show you the interface:

private static class Variable {
    private Variable(String identifier, CtBlock scope);
    // Methods use identifier and scope
    @Override
    public boolean equals(Object o);
    @Override
    public int hashCode();
}

로그인 후 복사

Here's the rest of the class content:

public class DataFlow {
    private static class Variable {
        // ....
    }
    Map<Variable, Object> variableCache = new HashMap<>();

    public void scanDataFlow(CtElement element) {
        if (element instanceof CtAssignment) {
            CtAssignment variableWrite = (CtAssignment) element;
            if (variableWrite.getAssignment() instanceof CtLiteral<?>) {
                var assigned = variableWrite.getAssigned();
                var variable = new Variable(
                    assigned.getVariableName(),
                    assigned.getScope()
                );
                variableCache.put(
                    variable, 
                    getValue(variableWrite.getAssignment())
                );
            }
        }
    }

    public Object getValue(CtExpression expression) {
        if (expression instanceof CtVariableRead) {
            CtVariableRead variableRead = (CtVariableRead) expression;
            var variable = new Variable(
                variableRead.getVariableName(),
                variableRead.getScope()
            );
            return variableCache.getOrDefault(variable, null);
        } else if (expression instanceof CtLiteral<?>) {
            return ((CtLiteral<?>) expression).getValue();
        }
        return null;
    }
}

로그인 후 복사

It's quite simple: in scanDataFlow, we associate the value with the variable, and in getValue, we extract that value for the set node. Everything is simple because we don't factor in branching, loops, or even expressions. Why don't we factor them in? Branching is the very topic that deserves its own series of articles. What about expressions? Well, we didn't make it in two days. Given what we've achieved in the last two days, we think this task is feasible. We just postpone it as a homework.

That's all. It's clear that such a solution is far from a real product, but we have laid some foundation. Then we can either try to enhance the code—and we'll introduce data-flow analysis via AST. Or we can redo everything properly and build the control-flow graph.

We've done the class implementation, but we haven't discussed how to use it. We've written the class, but what shall we do with it? Let's talk about it in the following section. Now we just say that DataFlow works right before calling the diagnostic rules. Those are called when traversing the finished AST. Thus, the rules will have access to the current variable values. This is called Environment; you can see it in your debugger.

Walker

Welcome to the last section regarding the core. We've already had the AST that is full of semantic data, as well as the data-flow analysis that is just waiting to be run. It's a good time to put it all together and set the stage for our diagnostic rules.

How is the analysis performed? It's simple: we get on the topmost tree node, and then we start recursive traversal. That is, we need something that will traverse the tree. We have CtScanner for that. Based on it, we define MunInvoker:

public class MunInvoker extends CtScanner {
    private final List<MunRule> rules = new ArrayList<>();
    private final Analyzer analyzer;

    public MunInvoker(Analyzer analyzer) {
        this.analyzer = analyzer;
        rules.add(new M6005(analyzer));
        rules.add(new M6020(analyzer));
        rules.add(new M7000(analyzer));
        rules.add(new M7001(analyzer));
    }

    @Override
    public <T extends CtElement> void scan(T element) {
        if (element != null) {
            analyzer.getDataFlow().scanDataFlow(element);
            rules.forEach(element::accept);
            super.scan(element);
        }
    }
}

로그인 후 복사

You can notice a few unknown things in the code:

the Analyzer class. It encapsulates the whole analysis process and contains shared resources that need to be accessed within the rules. In our case, this is the DataFlow instance. We'll get back to Analyzer later;
four confusing classes that are added to rules. We'll talk about them in the next section, so don't panic. A little spoiler for you :)

Otherwise, the class operation shouldn't raise any questions. Each time we enter any tree node, the analyzer rule is called, and the variable values are evaluated right before it. Next, the traversal continues in line with to the CtScanner algorithm.

Analysis

Preparing for writing diagnostic rules

Rule class

So, we have the analyzer core prototype—it's a good time to start analyzing something.

The base for our rules is ready—it's the CtAbstractVisitor class. The analysis goes as follows: the rule overloads a few visitors and analyzes the data contained in the AST nodes. Let's extend CtAbstractVisitor with the abstract MunRule class, which we use to create rules. In this class, we also define the addRule method that generates warnings.

Speaking of warnings: what data do they need? First is a warning message to demonstrate users what they may have been mistaken about. Second, the user needs to know where and what the analyzer warns. So, let's add data about the file where the analyzer has detected the troublesome code block and the location of this code fragment.

Here's what the MunRule class looks like:

public abstract class MunRule extends CtAbstractVisitor {
    private Analyzer analyzer;
    public void MunRule(Analyzer analyzer) {
        this.analyzer = analyzer;
    }
    protected Analyzer getAnalyzer() {
        return analyzer;
    }

    protected void addRule(String message, CtElement element) {
        var warning = new Warning();
        warning.message = message;

        WarningPosition pos = new WarningPosition(
            Analyzer.getFile(), 
            element.getLine(), 
            element.getColumn() + 1
        );
        warning.positions.add(pos);
        analyzer.addWarning(warning);
    }

    public DataFlow getDataFlow() {
        return analyzer.getDataFlow();
    }
}

로그인 후 복사

The WarningPosition and Warning classes are just data stores, so we won't list them. We'll talk about addWarning now.

Merging it together

The last thing to prepare is how we're going to view the diagnostic rules. To do this, we combine all our features together, using the already mentioned Analyzer class. Actually, here it is:

public class Analyzer {
    private DataFlow dataFlow = new DataFlow();

    public DataFlow getDataFlow() {
        return dataFlow;
    }

    public CtElement getAst(String pathToFile) throws IOException {
        InputStream inputStream = new FileInputStream(pathToFile);
        Lexer lexer = new LuaLexer(CharStreams.fromStream(inputStream));

        ParseTreeWalker walker = new ParseTreeWalker();
        var listener = new LuaAstParser();
        walker.walk(listener, new LuaParser(
            new CommonTokenStream(lexer)
        ).start_());
        return listener.getFile();
    }

    protected void addWarning(Warning warning) {
        Main.logger.info(
            "WARNING: " + warning.code + " "
            + warning.message + " ("
            + warning.positions.get(0).line + ", "
            + warning.positions.get(0).column + ")");
      }      

    public MunInvoker getMunInvoker() {
        return new MunInvoker(this);
    }

    public void analyze(String pathToFile) {
        try {
            var top = getAst(pathToFile);
            var invoker = getMunInvoker();
            invoker.scan(top);
        }
        catch (IOException ex) {
            Main.logger.error("IO error: " + ex.getMessage());
        }
    }
}

로그인 후 복사

To explain how it works, we'll give you an example of the whole analysis process:

In the getAst method, we build our AST using the scheme lexer —> parser —> tree translator;
Then we call MunIvoker, which traverses the tree and calls our diagnostic rules along with the data-flow analysis;
If necessary, the rules call the Analyzer class to get the DataFlow instance;
They call addWarning when the analyzer spots suspicious code fragment. That one just outputs the message to the log.

We've got the prep work—it's time to start writing diagnostic rules.

Writing diagnostic rules

Assigning a variable to itself

We've decided to start writing the rules with a simple one: PVS-Studio has the Java diagnostic rule, V6005, where the analyzer check if a variable is assigned to itself. It can simply be copied and slightly adapted to our tree. Since our analyzer is called Mun, we start the numbers of our diagnostic rules with M. Let's create the M6005 class, extending MunRule, and override the visitCtAssignment method in it. The following check will be located in the method:

public class M6005 extends MunRule {
    private void addRule(CtVariableAccess variable) {
        addRule("The variable is assigned to itself.", variable);
    }
    @Override
    public void visitCtAssignment(CtAssignment assignment) {
        if (RulesUtils.equals(assignment.getAssigned(), 
            assignment.getAssignment())) {
            addRule(assignment.getAssigned());
        }
    }
}

로그인 후 복사

The used RulesUtils.equals method is a wrapper and overload for another equals method that checks the name and scope:

public static boolean equals(CtVariableAccess left, CtVariableAccess right) {
    return left.getVariableName().equals(right.getVariableName())
           && left.getScope().equals(right.getScope());
}

로그인 후 복사

We need to check the scope, because the next code fragment isn't an assignment of the variable to itself:

local a = 5;
begin
    local a = a;
end

로그인 후 복사

Now, we can test the diagnostic rule on some simple code and see if it works. The following example will issue the warning on the marked line: "M6005 The variable is assigned to itself":

local a = 5;
local b = 3;

if (b > a) then
    a = a;      <=
end

로그인 후 복사

Zero division

Well, we've warmed up, so let's move on. The analyzer has already had the primitive data-flow analysis (DataFlow) that we can and should use. Again, let's look at one of our existing diagnostic rules, V6020, where the analyzer checks for the zero division. We try to adapt it for our analyzer. The divisor can be either the variable as zero or a literal as zero. We need to access the cache of variables to check their value.

Here's the simple implementation of such a diagnostic rule:

public class M6020 extends MunRule {
  private void addWarning(CtElement expression, String opText) {
    addRule(String.format(
            "%s by zero. Denominator '%s' == 0.",
            opText, expression instanceof CtLiteral
                ? ((CtLiteral) expression).getValue()
                : ((CtVariableRead) expression).getVariableName()
        ),
        expression
    );
  }

  @Override
  public void visitCtBinaryOperator(CtBinaryOperator operator) {
    BinaryOperatorKind opKind = operator.getKind();
    if (opKind != BinaryOperatorKind.DIV && opKind != BinaryOperatorKind.MOD) {
      return;
    }

    apply(operator.getRightHandOperand(), opKind == BinaryOperatorKind.MOD);
  }


  private void apply(CtExpression expr, boolean isMod) {
    Object variable = getDataFlow().getValue(expr);
    if (variable instanceof Integer) {
      if ((Integer) variable == 0) {
        String opText = isMod ? "Mod" : "Divide";
        addWarning(expr, opText);
      }
    }
  }
}

로그인 후 복사

We can see that the diagnostic rule works on simple examples and issues the following warning: "M6020 Divide by zero. Denominator 'b' == 0" on these lines:

local a = 5;
local b = 0;
local c = a / b;   <=
local d = a / 0;   <=

로그인 후 복사

If you have made the expression evaluator as your homework, you may try the diagnostic rule on this code:

local b = 7;
local b = b - 7;
local c = a / b;

로그인 후 복사

Overwritten types

Since we're writing the Lua analyzer, we need to write diagnostic rules for this language. Let's start simple.

Lua is a dynamic scripting language. Let's use this feature and write the diagnostic rule that allows us to catch overwritten types.

We'll also need to pick a new number for the diagnostic rule. Earlier we just copied them from the Java analyzer. Now, it seems like it's time to start a new thousand—the seventh. Who knows what diagnostic rule it'll be in PVS-Studio, but at the time of writing this article, it will be Lua.

Knowing about types, we facilitate our case: we need to check that the left and right assignments are of different types. Note that we ignore Undefined for the left part and Nil for the right part. The code looks like this:

public class M7000 extends MunRule {
    @Override
    public void visitCtAssignment(CtAssignment assignment) {
        var assigned = assignment.getAssigned();
        var exprType = SemanticResolver.evaluateExpressionType(
            assignment.getAssignment());
        if (assigned.getTypeKind().equals(TypeKind.Undefined)
            || exprType.equals(TypeKind.Nil)
        ) {
            return;
        }

        if (!assigned.getTypeKind().equals(exprType)) {
            addRule(
                String.format(
                    "Type of the variable %s is overridden from %s to %s.",
                    assigned.getTypeKind().toString(),
                    exprType.toString()
                assigned,
            );
        }
    }
}

로그인 후 복사

It's time to check the diagnostic rule in real cases. In the following example, our analyzer issues the warning only on the last line:

local a = "string";

if (true) then
  local a = 5;
end

a = 5;     <=

로그인 후 복사

The analyzer warning: "M7000 Type of the variable a is overridden from Integer to String".

Lost local

Let's finish smoothly. The Lua plugin for VS Code has a diagnostic rule that detects global lowercase variables. This check can help detect the forgotten local identifiers. Let's implement the same diagnostic rule in our analyzer.

Here, just as before, we'll need to use the variable scope data that is obtained via the semantic analysis. We just find where the variable was declared. That's also where the variable is assigned a value. Then check its scope and name. If the variable is global and starts with a lowercase, the analyzer will warn. Easy-peasy.

Let's create a new class and override the visitCtAssignment method in it again. That way, we can look for the problematic global variables:

public class M7001 extends MunRule {
    @Override
    public void visitCtAssignment(CtAssignment assignment) {
        var variable = assignment.getAssigned();
        var firstLetter = variable.getVariableName().substring(0, 1);

        if (variable.getScope() instanceof CtGlobal && 
            !firstLetter.equals(firstLetter.toUpperCase())) {
            addRule("Global variable in lowercase initial.", variable);
        }
    }
}

로그인 후 복사

Let's check the diagnostic rule work. It issues the warning: "M7001 Global variable in lowercase initial." It's issued on the second line in the code snippet below:

function sum_numbers(b, c)
    a = b + c;  <=
    return a;
end

local a = sum_numbers(10, 5);

로그인 후 복사

Alright, we've written the diagnostic rules, we've done with the analyzer (and it even works). Breathe out. Now we can enjoy the fruits of our labors.

Viewing warnings

Above we've already shown the code that allows us to view warnings in a console or file. Here is how the process of working with the analyzer looks like in the console. We run the analyzer using the command:

java -jar mun-analyzer.jar -input "C:\munproject\test.lua"

And we get:

How to develop code analyzer in hours

However, first, the analyzer is a tool, and it should be user-friendly. Instead of working with the console, it'd be much more convenient to work with a plugin.

다시 빌릴 시간이 왔습니다. PVS-Studio에는 이미 Visual Studio Code 및 IntelliJ IDEA와 같은 다양한 IDE용 플러그인이 있습니다. 경고를 보고 탐색할 수 있습니다. 분석기 보고서에는 표준화된 형식이 있으므로 Java 분석기에서 JSON 보고서 생성 알고리즘을 간단히 빌릴 수 있습니다. 알고리즘은 광범위하고 지루하므로 보여주지 않겠습니다. 여전히 명령줄에서 실행해야 하지만 인수* -output "D:git/MunProject/report.json"*을 사용하여 실행해야 합니다.

그런 다음 IntelliJ IDEA 또는 VS Code에서 보고서를 열고 Lua 분석기 경고를 확인할 수 있습니다.

How to develop code analyzer in hours

달콤해요! 이제 유용성을 희생하지 않고도 분석기를 원래 목적에 맞게 사용할 수 있습니다.

애드 아스트라

그럼 본격적인 분석기를 작성해 볼까요? 어, 정확하진 않아요. 이 긴 여정의 끝에는 모든 단계를 거치는 가장 실제적인 조종사가 있습니다. 그러나 성장 범위는 엄청납니다.

핵심 개선 사항:
- 데이터 흐름 분석을 강화합니다.
- 제어 흐름을 고려하세요.
- 프로시저 간 및 모듈 간 분석을 추가합니다. 여기에서 C++로 어떻게 작업했는지 읽어보실 수 있습니다.
- 데이터 흐름 분석 및 덕 타이핑을 향상하는 데 도움이 되는 주석 메커니즘을 추가합니다.
- 더 많은 의미론적 데이터 제공
- 기존 메커니즘을 미세 조정합니다.
- 더 나은 파서도 잊지 마세요.
진단 규칙 개선 사항:
- 기존 기능을 강화하고
- 새 글을 더 많이 작성하세요.
- 몇 줄이 넘는 더 복잡한 진단 규칙을 작성하세요.
분석기 유용성 향상:
- 적절한 플러그인 지원을 생성하세요.
- CI/CD에 통합합니다.
개발 및 수정 시 진단 규칙 성능을 확인하기 위한 단위 테스트 및 회귀 테스트

그리고 훨씬 더 많은 것입니다. 즉, 파일럿에서 본격적인 도구로 가는 길은 매우 까다롭습니다. 그래서 PVS-Studio는 새로운 방향이 아닌 기존 방향인 C#, C, C++, Java에 중점을 두고 있습니다. 이러한 언어로 글을 쓰신다면 당사의 분석기를 사용해 보시기 바랍니다.