I do recommend the Parquet format for storing the data. However, for completeness, I need to at least mention other serialization formats, some of them like Kryo will be used implicitly for you during Spark computations without your knowledge and there is obviously a default Java serialization.
Object-oriented approach versus functional approach
Objects in object-oriented approach are characterized by state and behavior. Objects are the cornerstone of object-oriented programming. A class is a template for objects with fields that represent the state, and methods that may represent the behavior. Abstract method implementation may depend on the instance of the class. In functional approach, the state is usually frowned upon; in pure programming languages, there should be no state, no side effects, and every invocation should return the same result. The behaviors may be expressed though additional function parameters and higher order functions (functions over functions, such as currying), but should be explicit unlike the abstract methods. Since Scala is a mix of object-oriented and functional language, some of the preceding constraints are violated, but this does not mean that you have to use them unless absolutely necessary. It is best practice to store the code in jar packages while storing the data, particularly for the big data, separate from code in data files (in a serialized form); but again, people often store data/configurations in jar files, and it is less common, but possible to store code in the data files.
The serialization has been an issue since the need to persist data on disk or transfer object from one JVM or machine to another over network appeared. Really, the purpose of serialization is to make complex nested objects be represented as a series of bytes, understandable by machines, and as you can imagine, this might be language-dependent. Luckily, serialization frameworks converge on a set of common data structures they can handle.
One of the most popular serialization mechanisms, but not the most efficient, is to dump an object in an ASCII file: CSV, XML, JSON, YAML, and so on. They do work for more complex nested data like structures, arrays, and maps, but are inefficient from the storage space perspective. For example, a Double represents a continuous number with 15-17 significant digits that will, without rounding or trivial ratios, take 15-17 bytes to represent in US ASCII, while the binary representation takes only 8 bytes. Integers may be stored even more efficiently, particularly if they are small, as we can compress/remove zeroes.
One advantage of text encoding is that they are much easier to visualize with simple command-line tools, but any advanced serialization framework now comes with a set of tools to work with raw records such as avro
- or parquet-tools
.
The following table provides an overview for most common serialization frameworks:
Serialization Format |
When developed |
Comments |
---|---|---|
XML, JSON, YAML |
This was a direct response to the necessity to encode nested structures and exchange the data between machines. |
While grossly inefficient, these are still used in many places, particularly in web services. The only advantage is that they are relatively easy to parse without machines. |
Protobuf |
Developed by Google in the early 2000s. This implements the Dremel encoding scheme and supports multiple languages (Scala is not officially supported yet, even though some code exists). |
The main advantage is that Protobuf can generate native classes in many languages. C++, Java, and Python are officially supported. There are ongoing projects in C, C#, Haskell, Perl, Ruby, Scala, and more. Run-time can call native code to inspect/serialize/deserialize the objects and binary representations. |
Avro |
Avro was developed by Doug Cutting while he was working at Cloudera. The main objective was to separate the encoding from a specific implementation and language, allowing better schema evolution. |
While the arguments whether Protobuf or Avro are more efficient are still ongoing, Avro supports a larger number of complex structures, say unions and maps out of the box, compared to Protobuf. Scala support is still to be strengthened to the production level. Avro files have schema encoded with every file, which has its pros and cons. |
Thrift |
The Apache Thrift was developed at Facebook for the same purpose Protobuf was developed. It probably has the widest selection of supported languages: C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml, Delphi, and other languages. Again, Twitter is hard at work for making the Thrift code generation in Scala (https://twitter.github.io/scrooge/). |
Apache Thrift is often described as a framework for cross-language services development and is most frequently used as Remote Procedure Call (RPC). Even though it can be used directly for serialization/deserialization, other frameworks just happen to be more popular. |
Parquet |
Parquet was developed in a joint effort between Twitter and Cloudera. Compared to the Avro format, which is row-oriented, Parquet is columnar storage that results in better compression and performance if only a few columns are to be selected. The interval encoding is Dremel or Protobuf-based, even though the records are presented as Avro records; thus, it is often called AvroParquet. |
Advances features such as indices, dictionary encoding, and RLE compression potentially make it very efficient for pure disk storage. Writing the files may be slower as Parquet requires some preprocessing and index building before it can be committed to the disk. |
Kryo |
This is a framework for encoding arbitrary classes in Java. However, not all built-in Java collection classes can be serialized. |
If one avoids non-serializable exceptions, such as priority queues, Kryo can be very efficient. Direct support in Scala is also under way. |
Certainly, Java has a built-in serialization framework, but as it has to support all Java cases, and therefore is overly general, the Java serialization is far less efficient than any of the preceding methods. I have certainly seen other companies implement their own proprietary serialization earlier, which would beat any of the preceding serialization for the specific cases. Nowadays, it is no longer necessary, as the maintenance costs definitely overshadow the converging inefficiency of the existing frameworks.
3.137.219.117