Avro is a row-based social system that can use the data serialization framework developed by Apache Hadoop. The Avro file is a data file that can carry the data serialization for serializing the data in a compact binary format. The schema will be in JSON format when we try it with Apache MapReduce; then, these files can reserve the markers when we have huge datasets that need to distribute into subsets. It also has a container file for reserving cautious data that can easily be read and written; there is no need to do extra configuration.
Start Your Free Software Development Course
Web development, programming languages, Software testing & others
The Avro file is a data serialization system that can supply a large data structure and compact, fast, binary data format. It can also have the container file, which can carry the continuous data and use the RPC procedures. Furthermore, as it has simple integration, it can be used with various languages, so new code creation is not necessary for reading or writing the data files in which creating the code will not be compulsory. It can only deploy with rigidly typed languages.
Normally it has two segments: the first is a schema that can be voluntary, and the second is binary data. So, for example, suppose we wanted to look at the avro file using the text editor. In that case, we can able to view the two-segment in which the first segment will contain the data which has been starting with the object, and the second segment will have data that can be able to read and the file type we need to confirm which Bhoomi will be able to read and write.
Let us see the configuration of the Avro file, in which we can transform the actions of Avro data files with the help of different structured parameters.
When we are using Hadoop,
When we try to configure the compression, then we have to set the following properties,
a spark.conf.set(“spark.sql.avro.compression.codec”, “deflate”)
spark.conf.set(“spark.sql.avro.deflate.level”, “4”).
There are two types of Avro files,
It includes null, Boolean, int, long, double, bytes, and string.
Schema: {"type": "null"}
{ "kind": "array" "objects": "long" }
{ "kind": "map" "values": "long" }
{ "kind": "record", "name": "---", "doc": "---", "area": [ {"name": "--", "type": "int"}, --- ] }
{ "kind": "enum", "name": "---", "doc": "---", "symbols": ["--", "--"] }
{ "kind": "fixed", "name": "---", "size": in bytes }
[ "null", "string", -- ]
Let us see the examples of avro files with schema and without the schema,
Avro file using schema:
import java.util.Properties import java.io.InputStream import com.boomi.execution.ExecutionUtil import org.apache.avro.Schema; import org.apache.avro.file.DataFileStream; import org.apache.avro.generic.GenericDatumReader; import org.apache.avro.generic.GenericRecord; import org.apache.avro.io.DatumReader; logger = ExecutionUtil.getBaseLogger(); for (int j = 0; j < dataContext.getDataCount(); j++) { InputStream istm = dataContext.getStream(j) Properties prop = dataContext.getProperties(j) DatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord>(); DataFileStream<GenericRecord> dataFileStream = new DataFileStream<GenericRecord>(istm, datumReader); Schema sche = dataFileStream.getSchema(); logger.info("Schema utilize for: " + sche); GenericRecord rec = null; while (dataFileStream.hasNext()) { rec = dataFileStream.next(rec); System.out.println(rec); istm = new ByteArrayInputStream(rec.toString().getBytes('UTF-8')) dataContext.storeStream(istm, prop) } }
In the above example in which schema has been used with the avro files, we can say that this is the script that can read the avro file, and in this, we have generated more than one JSON document. We have imported the related packages, set the schema, and have called it by creating the object and writing the data in JSON using code as given in the above script.
Avro file without a schema:
import java.util.Properties import java.io.InputStream import com.boomi.execution.ExecutionUtil import org.apache.avro.Schema; import org.apache.avro.file.DataFileStream; import org.apache.avro.generic.GenericDatumReader; import org.apache.avro.generic.GenericRecord; import org.apache.avro.io.DatumReader; logger = ExecutionUtil.getBaseLogger(); String schemaString = '{"type":"record","name":"college","namespace":"student.avro",' + '"fields":[{"name":"title","type":"string","doc":"college title"},{"name":"exam_date","type":"string","sub":"start date"},{"name":"teacher","type":"int","sub":"main charactor is the teacher in college"}]}' for (int k = 0; k < dataContext.getDataCount(); k++) { InputStream istm = dataContext.getStream(k) Properties prop = dataContext.getProperties(k) DatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord>(); DataFileStream<GenericRecord> dataFileStre= new DataFileStream<GenericRecord>(istm, datumReader); Schema sche = Schema.parse(scheString) logger.info("Schema used: " + sche); GenericRecord rec = null; while (dataFileStre.hasNext()) { rec = dataFileStre.next(rec); System.out.println(rec); is = new ByteArrayInputStream(rec.toString().getBytes('UTF-8')) dataContext.storeStream(is, prop) } }
In the above example, we have written an example of reading files without schema in which we have to understand that if we have not included the schema under the avro file, then we have to perform some steps for informing the interpreter how to explain binary avro data, we also need to generate the schema which has been utilizing, in which this example can avro schema with a different name. We can also set it on another path.
In this article, we have concluded that the avro file is a data file that can work with the data serialized system utilized by Apache Hadoop. It has an open-source platform; we have also seen the configuration of the data files and examples, which helps to understand the concept.
The above is the detailed content of Avro File. For more information, please follow other related articles on the PHP Chinese website!