java 读写Parquet格式的数据的示例代码
作者:Nucky_yang 时间:2022-09-16 11:09:47
本文介绍了java 读写Parquet格式的数据,分享给大家,具体如下:
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.Random;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.log4j.Logger;
import org.apache.parquet.example.data.Group;
import org.apache.parquet.example.data.GroupFactory;
import org.apache.parquet.example.data.simple.SimpleGroupFactory;
import org.apache.parquet.hadoop.ParquetReader;
import org.apache.parquet.hadoop.ParquetReader.Builder;
import org.apache.parquet.hadoop.ParquetWriter;
import org.apache.parquet.hadoop.example.GroupReadSupport;
import org.apache.parquet.hadoop.example.GroupWriteSupport;
import org.apache.parquet.schema.MessageType;
import org.apache.parquet.schema.MessageTypeParser;
public class ReadParquet {
static Logger logger=Logger.getLogger(ReadParquet.class);
public static void main(String[] args) throws Exception {
// parquetWriter("test\\parquet-out2","input.txt");
parquetReaderV2("test\\parquet-out2");
}
static void parquetReaderV2(String inPath) throws Exception{
GroupReadSupport readSupport = new GroupReadSupport();
Builder<Group> reader= ParquetReader.builder(readSupport, new Path(inPath));
ParquetReader<Group> build=reader.build();
Group line=null;
while((line=build.read())!=null){
Group time= line.getGroup("time", 0);
//通过下标和字段名称都可以获取
/*System.out.println(line.getString(0, 0)+"\t"+
line.getString(1, 0)+"\t"+
time.getInteger(0, 0)+"\t"+
time.getString(1, 0)+"\t");*/
System.out.println(line.getString("city", 0)+"\t"+
line.getString("ip", 0)+"\t"+
time.getInteger("ttl", 0)+"\t"+
time.getString("ttl2", 0)+"\t");
//System.out.println(line.toString());
}
System.out.println("读取结束");
}
//新版本中new ParquetReader()所有构造方法好像都弃用了,用上面的builder去构造对象
static void parquetReader(String inPath) throws Exception{
GroupReadSupport readSupport = new GroupReadSupport();
ParquetReader<Group> reader = new ParquetReader<Group>(new Path(inPath),readSupport);
Group line=null;
while((line=reader.read())!=null){
System.out.println(line.toString());
}
System.out.println("读取结束");
}
/**
*
* @param outPath输出Parquet格式
* @param inPath 输入普通文本文件
* @throws IOException
*/
static void parquetWriter(String outPath,String inPath) throws IOException{
MessageType schema = MessageTypeParser.parseMessageType("message Pair {\n" +
" required binary city (UTF8);\n" +
" required binary ip (UTF8);\n" +
" repeated group time {\n"+
" required int32 ttl;\n"+
" required binary ttl2;\n"+
"}\n"+
"}");
GroupFactory factory = new SimpleGroupFactory(schema);
Path path = new Path(outPath);
Configuration configuration = new Configuration();
GroupWriteSupport writeSupport = new GroupWriteSupport();
writeSupport.setSchema(schema,configuration);
ParquetWriter<Group> writer = new ParquetWriter<Group>(path,configuration,writeSupport);
//把本地文件读取进去,用来生成parquet格式文件
BufferedReader br =new BufferedReader(new FileReader(new File(inPath)));
String line="";
Random r=new Random();
while((line=br.readLine())!=null){
String[] strs=line.split("\\s+");
if(strs.length==2) {
Group group = factory.newGroup()
.append("city",strs[0])
.append("ip",strs[1]);
Group tmpG =group.addGroup("time");
tmpG.append("ttl", r.nextInt(9)+1);
tmpG.append("ttl2", r.nextInt(9)+"_a");
writer.write(group);
}
}
System.out.println("write end");
writer.close();
}
}
说下schema(写Parquet格式数据需要schema,读取的话"自动识别"了schema)
/*
* 每一个字段有三个属性:重复数、数据类型和字段名,重复数可以是以下三种:
* required(出现1次)
* repeated(出现0次或多次)
* optional(出现0次或1次)
* 每一个字段的数据类型可以分成两种:
* group(复杂类型)
* primitive(基本类型)
* 数据类型有
* INT64, INT32, BOOLEAN, BINARY, FLOAT, DOUBLE, INT96, FIXED_LEN_BYTE_ARRAY
*/
这个repeated和required 不光是次数上的区别,序列化后生成的数据类型也不同,比如repeqted修饰 ttl2 打印出来为 WrappedArray([7,7_a]) 而 required修饰 ttl2 打印出来为 [7,7_a]除了用MessageTypeParser.parseMessageType类生成MessageType 还可以用下面方法
(注意这里有个坑--spark里会有这个问题--ttl2这里 as(OriginalType.UTF8) 和 required binary city (UTF8)作用一样,加上UTF8,在读取的时候可以转为StringType,不加的话会报错 [B cannot be cast to java.lang.String )
/*MessageType schema = MessageTypeParser.parseMessageType("message Pair {\n" +
" required binary city (UTF8);\n" +
" required binary ip (UTF8);\n" +
"repeated group time {\n"+
"required int32 ttl;\n"+
"required binary ttl2;\n"+
"}\n"+
"}");*/
//import org.apache.parquet.schema.Types;
MessageType schema = Types.buildMessage()
.required(PrimitiveTypeName.BINARY).as(OriginalType.UTF8).named("city")
.required(PrimitiveTypeName.BINARY).as(OriginalType.UTF8).named("ip")
.repeatedGroup().required(PrimitiveTypeName.INT32).named("ttl")
.required(PrimitiveTypeName.BINARY).as(OriginalType.UTF8).named("ttl2")
.named("time")
.named("Pair");
解决 [B cannot be cast to java.lang.String 异常:
1.要么生成parquet文件的时候加个UTF8
2.要么读取的时候再提供一个同样的schema类指定该字段类型,比如下面:
maven依赖(我用的1.7)
<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-hadoop</artifactId>
<version>1.7.0</version>
</dependency>
来源:http://www.cnblogs.com/yanghaolie/p/7156372.html
标签:java,parquet
![](/images/zang.png)
![](/images/jiucuo.png)
猜你喜欢
Java中线程的等待与唤醒_动力节点Java学院整理
2023-07-17 04:03:05
![](https://img.aspxhome.com/file/2023/4/57694_0s.png)
Java基础题新手练习(二)
2022-03-10 00:11:57
![](https://img.aspxhome.com/file/2023/5/69705_0s.jpg)
C#读写文本文件(.txt)的方法实例
2023-12-23 21:33:57
Jetpack Compose 的新型架构 MVI使用详解
2023-01-24 14:53:26
![](https://img.aspxhome.com/file/2023/2/139392_0s.jpg)
java图片验证码生成教程详解
2021-11-04 13:22:14
![](https://img.aspxhome.com/file/2023/7/60987_0s.jpg)
C#实现抓取和分析网页类实例
2023-09-21 04:27:32
让Java后台MySQL数据库能够支持emoji表情的方法
2022-12-30 04:24:45
Java实现马踏棋盘算法
2023-03-05 04:30:46
![](https://img.aspxhome.com/file/2023/4/76254_0s.jpg)
C# for循环的经典案例集锦
2022-11-18 20:01:53
基于C#的socket编程的TCP异步的实现代码
2023-04-13 06:42:05
![](https://img.aspxhome.com/file/2023/4/104274_0s.jpg)
response文件流输出文件名中文不显示的解决
2023-02-06 19:41:02
@Async导致controller 404及失效原因解决分析
2021-12-17 01:51:44
android 控件同时监听单击和双击实例
2022-11-16 15:45:33
![](https://img.aspxhome.com/file/2023/7/87317_0s.jpg)
Android开发手册TextInputLayout样式使用示例
2023-10-16 11:03:12
JavaEE开发之SpringMVC中的自定义消息转换器与文件上传
2023-11-24 19:36:02
![](https://img.aspxhome.com/file/2023/4/59504_0s.jpg)
Spring AOP的概念与实现过程详解
2023-02-17 02:54:12
![](https://img.aspxhome.com/file/2023/4/85434_0s.png)
关于Java整合RocketMQ实现生产消费详解
2022-12-31 07:44:34
![](https://img.aspxhome.com/file/2023/2/80932_0s.png)
全面理解java中的异常处理机制
2023-10-26 04:08:20
Flutter绘图组件之CustomPaint使用详解
2021-12-13 01:26:06
![](https://img.aspxhome.com/file/2023/4/129024_0s.jpg)
解决Spring国际化文案占位符失效问题的方法
2022-10-20 23:42:23