In your personal or professional projects, you sometimes process large volumes of data. Data batch processing is an efficient way of processing large volumes of data where the data is collected, processed and then the batch results are produced. Batch processing can be applied in many use cases. A common use case for batch processing is transforming a large set of CSV or JSON files into a structured format ready for further processing.
In this tutorial, we will try to see how to set up this architecture with Spring Boot which is a framework that facilitates the development of applications based on Spring.
Spring Batch is an open source framework for batch processing. It is a lightweight, comprehensive solution designed to enable the development of robust batch applications, often found in modern enterprise systems. Its development is the result of a collaboration between SpringSource and Accenture.
It makes it possible to overcome recurring problems during batch development:
NB: In IT, a batch is a program operating in StandAlone, carrying out a set of processing operations on a volume of data.
To manage batch data, we mainly use the following three tools:
JobLauncher: This is the component responsible for launching/starting the batch program. It can be configured to trigger itself or to be triggered by an external event (manual launch). In the Spring Batch workflow, the JobLauncher is responsible for executing a Job.
Job: This is the component that represents the task to which responsibility for the business need addressed in the program is delegated. It is responsible for sequentially launching one or more Steps.
Step: this is the component that envelops the very heart of the business need to be addressed. It is responsible for defining three subcomponents structured as follows:
ItemReader: This is the component responsible for reading the input data to be processed. They can come from various sources (databases, flat files (csv, xml, xls, etc.), queue);
ItemProcessor: This is the component responsible for transforming the data read. It is within it that all management rules are implemented.
ItemWriter: this component saves the data transformed by the processor in one or more desired containers (databases, flat files (csv, xml, xls, etc.), cloud).
JobRepository: this is the component responsible for recording statistics from monitoring on the JobLauncher, the Job and the Step(s) at each execution. It offers two possible techniques for storing these statistics: using a database or using a Map. When the statistics are stored in a database, and therefore persisted in a lasting manner, this allows the continuous monitoring of the Batch over time in order to analyze possible problems in the event of failure. Conversely, when it is in a Map, the persisted statistics will be lost at the end of each Batch execution instance. In all cases, one or the other must be configured.
For more information, I advise you to consult the Spring website.
After this brief explanation of the spring batch architecture, let's now try to show how to set up a spring batch job which will read data from a CSV file which we will subsequently insert into a database."Let's get into coding".
The easiest way to generate a Spring Boot project is to use the Spring Boot Tool with the steps below:
Once the project is generated, you must unzip it then import it into your IDE.
Technologies used:
All project dependencies are in the pom.xml file. The three letters POM are the acronym for Project Object Model. Its XML representation is translated by Maven into a data structure that represents the project model.
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <parent> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-parent</artifactId> <version>2.3.3.RELEASE</version> <relativePath/> <!-- lookup parent from repository --> </parent> <groupId>com.pathus</groupId> <artifactId>SpringBatchExample</artifactId> <version>0.0.1-SNAPSHOT</version> <name>SpringBatchExample</name> <description>Demo of spring batch project </description> <properties> <java.version>1.8</java.version> </properties> <dependencies> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-batch</artifactId> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-data-jpa</artifactId> </dependency> <dependency> <groupId>com.h2database</groupId> <artifactId>h2</artifactId> <scope>runtime</scope> </dependency> <dependency> <groupId>org.projectlombok</groupId> <artifactId>lombok</artifactId> <optional>true</optional> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-test</artifactId> <scope>test</scope> <exclusions> <exclusion> <groupId>org.junit.vintage</groupId> <artifactId>junit-vintage-engine</artifactId> </exclusion> </exclusions> </dependency> <dependency> <groupId>org.springframework.batch</groupId> <artifactId>spring-batch-test</artifactId> <scope>test</scope> </dependency> </dependencies> <build> <plugins> <plugin> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-maven-plugin</artifactId> </plugin> </plugins> </build> </project>
The structure of the project is as follows:
To enable batch processing, we need to annotate the configuration class with @EnableBatchProcessing. We must then create a reader to read our CSV file, create a processor to process the input data before writing, create a writer to write to the database.
import org.springframework.batch.core.Job; import org.springframework.batch.core.Step; import org.springframework.batch.core.configuration.annotation.EnableBatchProcessing; import org.springframework.batch.core.configuration.annotation.JobBuilderFactory; import org.springframework.batch.core.configuration.annotation.StepBuilderFactory; import org.springframework.batch.item.ItemProcessor; import org.springframework.batch.item.ItemReader; import org.springframework.batch.item.ItemWriter; import org.springframework.batch.item.file.FlatFileItemReader; import org.springframework.batch.item.file.LineMapper; import org.springframework.batch.item.file.mapping.DefaultLineMapper; import org.springframework.batch.item.file.transform.DelimitedLineTokenizer; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.beans.factory.annotation.Value; import org.springframework.context.annotation.Bean; import org.springframework.context.annotation.Configuration; import org.springframework.core.io.ClassPathResource; import com.pathus90.springbatchexample.batch.StudentProcessor; import com.pathus90.springbatchexample.batch.StudentWriter; import com.pathus90.springbatchexample.model.Student; import com.pathus90.springbatchexample.model.StudentFieldSetMapper; @Configuration @EnableBatchProcessing @EnableScheduling public class BatchConfig { private static final String FILE_NAME = "results.csv"; private static final String JOB_NAME = "listStudentsJob"; private static final String STEP_NAME = "processingStep"; private static final String READER_NAME = "studentItemReader"; @Value("${header.names}") private String names; @Value("${line.delimiter}") private String delimiter; @Autowired private JobBuilderFactory jobBuilderFactory; @Autowired private StepBuilderFactory stepBuilderFactory; @Bean public Step studentStep() { return stepBuilderFactory.get(STEP_NAME) .<Student, Student>chunk(5) .reader(studentItemReader()) .processor(studentItemProcessor()) .writer(studentItemWriter()) .build(); } @Bean public Job listStudentsJob(Step step1) { return jobBuilderFactory.get(JOB_NAME) .start(step1) .build(); } @Bean public ItemReader<Student> studentItemReader() { FlatFileItemReader<Student> reader = new FlatFileItemReader<>(); reader.setResource(new ClassPathResource(FILE_NAME)); reader.setName(READER_NAME); reader.setLinesToSkip(1); reader.setLineMapper(lineMapper()); return reader; } @Bean public LineMapper<Student> lineMapper() { final DefaultLineMapper<Student> defaultLineMapper = new DefaultLineMapper<>(); final DelimitedLineTokenizer lineTokenizer = new DelimitedLineTokenizer(); lineTokenizer.setDelimiter(delimiter); lineTokenizer.setStrict(false); lineTokenizer.setNames(names.split(delimiter)); final StudentFieldSetMapper fieldSetMapper = new StudentFieldSetMapper(); defaultLineMapper.setLineTokenizer(lineTokenizer); defaultLineMapper.setFieldSetMapper(fieldSetMapper); return defaultLineMapper; } @Bean public ItemProcessor<Student, Student> studentItemProcessor() { return new StudentProcessor(); } @Bean public ItemWriter<Student> studentItemWriter() { return new StudentWriter(); } }
The first method defines the job and the second defines a single step. Jobs are created from steps, where each step can involve a reader, a processor and a writer. In the step definition, we define the amount of data to write at a time and in our case, it writes up to 5 records at a time. Then, we configure the reader, the processor and the writer using the beans injected previously. When defining our job, it will be able to define different steps within our execution through a precise order. the step studentStep will be executed by the job listStudentsJob.
@Bean public Step studentStep() { return stepBuilderFactory.get(STEP_NAME) .<Student, Student>chunk(5) .reader(studentItemReader()) .processor(studentItemProcessor()) .writer(studentItemWriter()) .build(); } @Bean public Job listStudentsJob(Step step1) { return jobBuilderFactory.get(JOB_NAME) .start(step1) .build(); }
In our batch configuration, the Reader reads a data source and is called successively within a step and returns objects for which it is defined (Student in our case).
@Bean public ItemReader<Student> studentItemReader() { FlatFileItemReader<Student> reader = new FlatFileItemReader<>(); reader.setResource(new ClassPathResource(FILE_NAME)); reader.setName(READER_NAME); reader.setLinesToSkip(1); reader.setLineMapper(lineMapper()); return reader; }
The FlatFileItemReader class uses the DefaultLineMapper class which in turn uses the DelimitedLineTokenizer class. The role of DelimitedLineTokenizer is to decompose each line into a FieldSet object and the names property gives the format of the file header and allows the data of each line to be identified. This names property is used by the data transformation implementation class into a business object through the FieldSet object. This is the class indicated by the fieldSetMapper (StudentFieldSetMapper) property.
import org.springframework.batch.item.file.mapping.FieldSetMapper; import org.springframework.batch.item.file.transform.FieldSet; public class StudentFieldSetMapper implements FieldSetMapper<Student> { @Override public Student mapFieldSet(FieldSet fieldSet) { return Student.builder() .rank(fieldSet.readString(0)) .firstName(fieldSet.readString(1)) .lastName(fieldSet.readString(2)) .center(fieldSet.readString(3)) .pv(fieldSet.readString(4)) .origin(fieldSet.readString(5)) .mention(fieldSet.readString(6)) .build(); } }
The LineMapper interface is used to map lines (strings) to objects generally used to map lines read from a file
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <parent> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-parent</artifactId> <version>2.3.3.RELEASE</version> <relativePath/> <!-- lookup parent from repository --> </parent> <groupId>com.pathus</groupId> <artifactId>SpringBatchExample</artifactId> <version>0.0.1-SNAPSHOT</version> <name>SpringBatchExample</name> <description>Demo of spring batch project </description> <properties> <java.version>1.8</java.version> </properties> <dependencies> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-batch</artifactId> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-data-jpa</artifactId> </dependency> <dependency> <groupId>com.h2database</groupId> <artifactId>h2</artifactId> <scope>runtime</scope> </dependency> <dependency> <groupId>org.projectlombok</groupId> <artifactId>lombok</artifactId> <optional>true</optional> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-test</artifactId> <scope>test</scope> <exclusions> <exclusion> <groupId>org.junit.vintage</groupId> <artifactId>junit-vintage-engine</artifactId> </exclusion> </exclusions> </dependency> <dependency> <groupId>org.springframework.batch</groupId> <artifactId>spring-batch-test</artifactId> <scope>test</scope> </dependency> </dependencies> <build> <plugins> <plugin> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-maven-plugin</artifactId> </plugin> </plugins> </build> </project>
Unlike a Reader, the implementations of a Processor are more for functional needs. It is not obligatory and we can do without it if no functional need is foreseen in our processing. In our example, we have written a simple processor which only converts a few attributes of our student object into uppercase and we can go beyond this example with more concrete functional cases.
import org.springframework.batch.core.Job; import org.springframework.batch.core.Step; import org.springframework.batch.core.configuration.annotation.EnableBatchProcessing; import org.springframework.batch.core.configuration.annotation.JobBuilderFactory; import org.springframework.batch.core.configuration.annotation.StepBuilderFactory; import org.springframework.batch.item.ItemProcessor; import org.springframework.batch.item.ItemReader; import org.springframework.batch.item.ItemWriter; import org.springframework.batch.item.file.FlatFileItemReader; import org.springframework.batch.item.file.LineMapper; import org.springframework.batch.item.file.mapping.DefaultLineMapper; import org.springframework.batch.item.file.transform.DelimitedLineTokenizer; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.beans.factory.annotation.Value; import org.springframework.context.annotation.Bean; import org.springframework.context.annotation.Configuration; import org.springframework.core.io.ClassPathResource; import com.pathus90.springbatchexample.batch.StudentProcessor; import com.pathus90.springbatchexample.batch.StudentWriter; import com.pathus90.springbatchexample.model.Student; import com.pathus90.springbatchexample.model.StudentFieldSetMapper; @Configuration @EnableBatchProcessing @EnableScheduling public class BatchConfig { private static final String FILE_NAME = "results.csv"; private static final String JOB_NAME = "listStudentsJob"; private static final String STEP_NAME = "processingStep"; private static final String READER_NAME = "studentItemReader"; @Value("${header.names}") private String names; @Value("${line.delimiter}") private String delimiter; @Autowired private JobBuilderFactory jobBuilderFactory; @Autowired private StepBuilderFactory stepBuilderFactory; @Bean public Step studentStep() { return stepBuilderFactory.get(STEP_NAME) .<Student, Student>chunk(5) .reader(studentItemReader()) .processor(studentItemProcessor()) .writer(studentItemWriter()) .build(); } @Bean public Job listStudentsJob(Step step1) { return jobBuilderFactory.get(JOB_NAME) .start(step1) .build(); } @Bean public ItemReader<Student> studentItemReader() { FlatFileItemReader<Student> reader = new FlatFileItemReader<>(); reader.setResource(new ClassPathResource(FILE_NAME)); reader.setName(READER_NAME); reader.setLinesToSkip(1); reader.setLineMapper(lineMapper()); return reader; } @Bean public LineMapper<Student> lineMapper() { final DefaultLineMapper<Student> defaultLineMapper = new DefaultLineMapper<>(); final DelimitedLineTokenizer lineTokenizer = new DelimitedLineTokenizer(); lineTokenizer.setDelimiter(delimiter); lineTokenizer.setStrict(false); lineTokenizer.setNames(names.split(delimiter)); final StudentFieldSetMapper fieldSetMapper = new StudentFieldSetMapper(); defaultLineMapper.setLineTokenizer(lineTokenizer); defaultLineMapper.setFieldSetMapper(fieldSetMapper); return defaultLineMapper; } @Bean public ItemProcessor<Student, Student> studentItemProcessor() { return new StudentProcessor(); } @Bean public ItemWriter<Student> studentItemWriter() { return new StudentWriter(); } }
The Writer writes the data coming from the processor (or directly read by the Reader). In our case, it receives the transformed objects from the processor and each object will subsequently be persisted in our database and the transaction will be validated.
@Bean public Step studentStep() { return stepBuilderFactory.get(STEP_NAME) .<Student, Student>chunk(5) .reader(studentItemReader()) .processor(studentItemProcessor()) .writer(studentItemWriter()) .build(); } @Bean public Job listStudentsJob(Step step1) { return jobBuilderFactory.get(JOB_NAME) .start(step1) .build(); }
@Bean public ItemReader<Student> studentItemReader() { FlatFileItemReader<Student> reader = new FlatFileItemReader<>(); reader.setResource(new ClassPathResource(FILE_NAME)); reader.setName(READER_NAME); reader.setLinesToSkip(1); reader.setLineMapper(lineMapper()); return reader; }
import org.springframework.batch.item.file.mapping.FieldSetMapper; import org.springframework.batch.item.file.transform.FieldSet; public class StudentFieldSetMapper implements FieldSetMapper<Student> { @Override public Student mapFieldSet(FieldSet fieldSet) { return Student.builder() .rank(fieldSet.readString(0)) .firstName(fieldSet.readString(1)) .lastName(fieldSet.readString(2)) .center(fieldSet.readString(3)) .pv(fieldSet.readString(4)) .origin(fieldSet.readString(5)) .mention(fieldSet.readString(6)) .build(); } }
Once we have finished setting up the batch configuration, let's now see if everything said above works
To run the application, you need to look for the file that contains the annotation @SpringBootApplication which is the main part of our application.
@Bean public LineMapper<Student> lineMapper() { final DefaultLineMapper<Student> defaultLineMapper = new DefaultLineMapper<>(); final DelimitedLineTokenizer lineTokenizer = new DelimitedLineTokenizer(); lineTokenizer.setDelimiter(delimiter); lineTokenizer.setStrict(false); lineTokenizer.setNames(names.split(delimiter)); final StudentFieldSetMapper fieldSetMapper = new StudentFieldSetMapper(); defaultLineMapper.setLineTokenizer(lineTokenizer); defaultLineMapper.setFieldSetMapper(fieldSetMapper); return defaultLineMapper; }
Launching the main above will start our job and the batch launcher looks like this:
import org.springframework.batch.item.ItemProcessor; import com.pathus90.springbatchexample.model.Student; public class StudentProcessor implements ItemProcessor<Student, Student> { @Override public Student process(Student student) { student.setFirstName(student.getFirstName().toUpperCase()); student.setLastName(student.getLastName().toUpperCase()); student.setCenter(student.getCenter().toUpperCase()); student.setOrigin(student.getOrigin().toUpperCase()); student.setMention(student.getMention().toUpperCase()); return student; } }
A scheduler has been set up to allow the batch to be triggered automatically. In this example, the batch once launched will run every 8 seconds. You can play with it by changing the fixedDelayvalue in milliseconds.
import java.util.List; import org.springframework.batch.item.ItemWriter; import org.springframework.beans.factory.annotation.Autowired; import com.pathus90.springbatchexample.model.Student; import com.pathus90.springbatchexample.service.IStudentService; import lombok.extern.slf4j.Slf4j; @Slf4j public class StudentWriter implements ItemWriter<Student> { @Autowired private IStudentService studentService; @Override public void write(List<? extends Student> students) { students.stream().forEach(student -> { log.info("Enregistrement en base de l'objet {}", student); studentService.insertStudent(student); }); } }
In addition to running the main file above to start the batch, you can also run the command mvn spring-boot:run while using a command prompt.
You can also launch the application with the JAR archive file and in this case you must:
Go to the parent folder of the project using a command prompt and execute the command mvn clean package which will package our project.
In the target folder, a jar file will be created.
To run the application, use the command java -jar target/generated_file_name-0.0.1-SNAPSHOT.jar
Also ensure that the H2console has already started when launching our spring batch application and the database is automatically generated as well as the creation of the table Student .
We can clearly see that our file has been well integrated into our database.
N.B:If we also want to start the batch manually without passing a schedulerwhich will be triggered depending on our settings, I have exposed an API using of the controller to call the Job Spring Batch.
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <parent> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-parent</artifactId> <version>2.3.3.RELEASE</version> <relativePath/> <!-- lookup parent from repository --> </parent> <groupId>com.pathus</groupId> <artifactId>SpringBatchExample</artifactId> <version>0.0.1-SNAPSHOT</version> <name>SpringBatchExample</name> <description>Demo of spring batch project </description> <properties> <java.version>1.8</java.version> </properties> <dependencies> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-batch</artifactId> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-data-jpa</artifactId> </dependency> <dependency> <groupId>com.h2database</groupId> <artifactId>h2</artifactId> <scope>runtime</scope> </dependency> <dependency> <groupId>org.projectlombok</groupId> <artifactId>lombok</artifactId> <optional>true</optional> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-test</artifactId> <scope>test</scope> <exclusions> <exclusion> <groupId>org.junit.vintage</groupId> <artifactId>junit-vintage-engine</artifactId> </exclusion> </exclusions> </dependency> <dependency> <groupId>org.springframework.batch</groupId> <artifactId>spring-batch-test</artifactId> <scope>test</scope> </dependency> </dependencies> <build> <plugins> <plugin> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-maven-plugin</artifactId> </plugin> </plugins> </build> </project>
just launch the URL: http://localhost:8080/load and the batch will launch
We have reached the end of our first learning about batch programming using the Spring framework. Leave comments or questions if you have any!
Happy learning everyone and I hope that this first tutorial will be beneficial to you.
You will find the source code available here
The above is the detailed content of Start programming with SPRING BATCH. For more information, please follow other related articles on the PHP Chinese website!