What is the ORC data storage format

Hive file format detailed

tags: Hadoop Hive hive

File_Format in lifting.

  • SequenceFile: It is absolutely out of production, the K-V format, and the source code format is used more
  • Text file: multi-line storage in production
  • RCFile: There is less production, ranking, OCR is its updated version
  • ORC: most commonly used in production, column storage
  • Parquet: most often used in production, column storage
  • AVRO: Almost unused production, not to be taken into account
  • Jsonfile: Almost no use, disregarded
  • INPUT FORMAT: Almost no use in production, do not consider

[NOTE] The standard format of Hive is textFile, can be configured via Set Hive.Default.FileFormat

 

[LINE memory and column storage]

  • Row storage and column storage storage data physical sub-layer storage

[Conclusion: I can know above]

  1. The line memory definitely stores the same line of data on the same block. When choosing the select query, it is a query for all fields. You cannot review a specific line separately.
  2. Store Columns The same column data must be stored in the same block. In other words, different columns can be placed in different blocks, and a column can be queried separately when performing a select query.

 

【Favorable point】

  1. Pros: When you query a specific field or a few, all you need to do is view those blocks that store those fields, greatly reducing the query area of ​​the data, and improving query result efficiency.
  2. Cons: If the full field query, the data must be reorganized, which is slower than a separate check.

 

  1. Advantages: The entire field query is relatively quick
  2. Cons: If you query a few fields in a table, the base value will still read all of the fields, lowering query efficiency and wasting resources unnecessarily and rarely requiring a full field query. scene

 

 

[Hive file format configuration implementation and comparison]

  • Create the original table standard text file
  • CREATE EXTERNAL TABLE g6_access (
  • cdn string,
  • region string,
  • level string,
  • time string,
  • ip string,
  • domain string,
  • url string,
  • traffic bigint)
  • ROW FORMAT DELIMITED
  • FIELDS TERMINATED BY '\ t'
  • LOCATION '/ g6 / hadoop / access / clear / test /';

 

 

  • Look at the size of the data by the color tone 64.9 MB

 

  • Create table G6_ACCESS_SEQ saved in SequenceFile format and use data in G6_ACCES
  • create table g6_access_seq
  • stored as SEQUENCEFILE
  • as select * from g6_access;
  • Show data size 71.8mb

  • Conclusion: The file is larger than the standard text file format, which is generally not used in production.

 

  • Create an RCFile data storage format table and use the data in G6_ACCES
  • create table g6_access_rc
  • stored as RCFILE
  • as select * from g6_access;
  • Show data size 61.6mb

  • Conclusion: the memory is reduced by about 3 m, slightly insignificant, and there is no advantage. The production does not use its reasons.

 

  • Create an OrcFile data storage format table and use the data in G6_ACCESS, the default value is Compress by default and implement by default.
  • create table g6_access_orc
  • stored as ORC
  • as select * from g6_access;

 

  • Show data size 17.0mb

 

 

 

  • Create an OrcFile data storage format table and use the data in G6_ACCES
  • create table g6_access_orc_none
  • stored as ORC tblproperties ("orc.compress" = "NONE")
  • as select * from g6_access;
  • Show data size 51.5mb

 

  • Create a parquet data storage format table, do not use compression and use data in G6_ACCES
  • create table g6_access_par
  • stored as PARQUET
  • as select * from g6_access;

 

  • Conclusion: The Orc file is not compressed. It has more than 10MB as the source file. The orc file uses standard compression, and the file is only a quarter of the source file.
  • Show data size: 58.3mb

 

  • Create a parquet data storage format table set up with the GZIP compression format and use the data in G6_ACCESS
  • set parquet.compression = gzip;
  • create table g6_access_par_zip
  • stored as PARQUET
  • as select * from g6_access;

 

  • Conclusion: The file size of parquet format is about 1/4 of the source file. It's also a great choice for production.

 

[Read data volume]

  • Do the select query directly, watch the log tail HDFS read: 190xxx, you can know the amount of data.

programmerwiki

Programmer encyclopedia

Copyright © 2020-2021 - All Rights Reserved - programmerwiki.com