Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modify picture and doc link references #6

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#### 🚩最近晚上打算抽些时间做点兼职工作,计划帮助刚进入这行的同学或者应届生同学修改简历并提供面试指导
#### 🚩本人目前为一线大厂大数据开发工程师,有意的同学可以联系VX:DongniVirgo
---
---

#### 大数据面试题汇总与答案分享

Expand Down Expand Up @@ -80,7 +80,7 @@

9. [SQL题: 按照学生科目分组, 取每个科目的TopN](./docs/按照学生科目取每个科目的TopN.md)

10. [SQL题: 获取每个用户的前1/4次的数据](./docs/获取每个用户的前1/4次的数据.md)
10. [SQL题: 获取每个用户的前1/4次的数据](./docs/获取每个用户的前14次的数据.md)



Expand Down
2 changes: 1 addition & 1 deletion docs/HDFS架构.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ Secondary NameNode 并不是 NameNode 的热备机,而是定期从 NameNode

### 2. HDFS 2.0 的 HA 实现

![2.0架构图](D:\Note\big-data-interview\BigData-Interview\pictures\hdfs-ha.png)
![2.0架构图](../pictures/hdfs-ha.png)

- **Active NameNode 和 Standby NameNode**:两台 NameNode 形成互备,一台处于 Active 状态,为主 NameNode,另外一台处于 Standby 状态,为备 NameNode,只有主 NameNode 才能对外提供读写服务;

Expand Down
2 changes: 1 addition & 1 deletion docs/Yarn调度MapReduce.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
## Yarn 调度MapReduce过程

![](../picturees/yarn调度mr过程.jpg)
![](../pictures/yarn调度mr过程.jpg)

1. Mr程序提交到客户端所在的节点(MapReduce)
2. yarnrunner向Resourcemanager申请一个application。
Expand Down
2 changes: 1 addition & 1 deletion docs/flink是如何实现反压的.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ flink的反压经历了两个发展阶段,分别是基于TCP的反压(<1.5)和

RS与IC之间通过backlog和credit来确定双方可以发送和接受的数据量的大小以提前感知,而不是通过TCP滑动窗口的形式来确定buffer的大小之后再进行反压

![](D:\Note\big-data-interview\BigData-Interview\pictures\flink基于credit的反压.png)
![](../pictures/flink基于credit的反压.png)



Expand Down
6 changes: 3 additions & 3 deletions docs/spark的shuffle介绍.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,15 @@

- **Hash Shuffle** 2.0以后移除

![](D:\Note\big-data-interview\BigData-Interview\pictures\spark-shuffle-v1.png)
![](../pictures/spark-shuffle-v1.png)

在map阶段(shuffle write),每个map都会为下游stage的每个partition写一个临时文件,假如下游stage有1000个partition,那么每个map都会生成1000个临时文件,一般来说一个executor上会运行多个map task,这样下来,一个executor上会有非常多的临时文件,假如一个executor上运行M个map task,下游stage有N个partition,那么一个executor上会生成M*N个文件。另一方面,如果一个executor上有K个core,那么executor同时可运行K个task,这样一来,就会同时申请K*N个文件描述符,一旦partition数较多,势必会耗尽executor上的文件描述符,同时生成K*N个write handler也会带来大量内存的消耗。
在map阶段(shuffle write),每个map都会为下游stage的每个partition写一个临时文件,假如下游stage有1000个partition,那么每个map都会生成1000个临时文件,一般来说一个executor上会运行多个map task,这样下来,一个executor上会有非常多的临时文件,假如一个executor上运行M个map task,下游stage有N个partition,那么一个executor上会生成M\*N个文件。另一方面,如果一个executor上有K个core,那么executor同时可运行K个task,这样一来,就会同时申请K\*N个文件描述符,一旦partition数较多,势必会耗尽executor上的文件描述符,同时生成K*N个write handler也会带来大量内存的消耗。

在reduce阶段(shuffle read),每个reduce task都会拉取所有map对应的那部分partition数据,那么executor会打开所有临时文件准备网络传输,这里又涉及到大量文件描述符,另外,如果reduce阶段有combiner操作,那么它会把网络中拉到的数据保存在一个`HashMap`中进行合并操作,如果数据量较大,很容易引发OOM操作。

- **Sort Shuffle** 1.1开始(sort shuffle也经历过优化升级,详细见参考文章1)

![](D:\Note\big-data-interview\BigData-Interview\pictures\spark-shuffle-v3.png)
![](../pictures/spark-shuffle-v3.png)

在map阶段(shuffle write),会按照partition id以及key对记录进行排序,将所有partition的数据写在同一个文件中,该文件中的记录首先是按照partition id排序一个一个分区的顺序排列,每个partition内部是按照key进行排序存放,map task运行期间会顺序写每个partition的数据,并通过一个索引文件记录每个partition的大小和偏移量。这样一来,每个map task一次只开两个文件描述符,一个写数据,一个写索引,大大减轻了Hash Shuffle大量文件描述符的问题,即使一个executor有K个core,那么最多一次性开K*2个文件描述符。

Expand Down
2 changes: 1 addition & 1 deletion docs/spark的stage是如何划分的.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,5 @@

**stage的划分依据就是看是否产生了shuflle(即宽依赖),遇到一个shuffle操作就划分为前后两个stage.**

![](D:\Note\big-data-interview\BigData-Interview\pictures\stageDivide.jpg)
![](../pictures/stageDivide.jpg)

2 changes: 1 addition & 1 deletion docs/按照学生科目取每个科目的TopN.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ id,name,subject,score

**按照各个科目的成绩排名 取 Top3**

```
```sql
select a.* from
(select id,name,subject,score,row_number() over(partition by subject order by score desc) rank from student) a
where a.rank <= 3
Expand Down
4 changes: 2 additions & 2 deletions docs/获取每个用户的前14次的数据.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
## Hive SQL: 获取每个用户的前1/4次的数据

```
```sql
cookieId createTime pv
--------------------------
cookie1 2015-04-10 1
Expand All @@ -21,7 +21,7 @@ cookie2 2015-04-16 7

获取每个用户前1/4次的访问记录

```
```sql
SELECT a.* from
(SELECT cookieid,createtime,pv,NTILE(4)
OVER(PARTITION BY cookieId ORDER BY createtime) AS rn
Expand Down
File renamed without changes.