Velero 系列文章（四）：使用 Velero 进行生产迁移实战

本文最后更新于：2024年7月25日下午

Velero

概述

目的

通过 velero 工具，实现以下整体目标:

特定 namespace 在 B A 两个集群间做迁移；

具体目标为:

在 B A 集群上创建 velero (包括 restic )
备份 B 集群 特定 namespace : caseycui2020:
1. 备份 resources - 如 deployments, configmaps 等；
  1. 备份前，排除特定 secrets 的 yaml.
2. 备份 volume 数据；(通过 restic 实现)
  1. 通过 "选择性启用" 的方式，只备份特定的 pod volume
迁移特定 namespace 到 A 集群 : caseycui2020:
1. 迁移 resources - 通过 include 的方式，仅迁移特定 resources;
2. 迁移 volume 数据. (通过 restic 实现)

安装

在您的本地目录中创建特定于 Velero 的凭证文件（credentials-velero):

使用的是 xsky 的对象存储: (公司的 netapp 的对象存储不兼容)
1
2
3
[default] aws_access_key_id = xxxxxxxxxxxxxxxxxxxxxxxx aws_secret_access_key = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
(openshift) 需要先创建 namespace : velero: oc new-project velero
默认情况下，用户维度的 openshift namespace 不会在集群中的所有节点上调度 Pod。

要在所有节点上计划 namespace，需要一个注释：
1
oc annotate namespace velero openshift.io/node-selector=""
这应该在安装 velero 之前完成。

启动服务器和存储服务。在 Velero 目录中，运行：

velero install \
    --provider aws \
    --plugins velero/velero-plugin-for-aws:v1.0.0 \
    --bucket velero \
    --secret-file ./credentials-velero \
    --use-restic \
    --use-volume-snapshots=true \
    --backup-location-config region="default",s3ForcePathStyle="true",s3Url="http://glacier.ewhisper.cn",insecureSkipTLSVerify="true",signatureVersion="4" \
    --snapshot-location-config region="default"

创建的内容包括:

CustomResourceDefinition/backups.velero.io: attempting to create resource
CustomResourceDefinition/backups.velero.io: created
CustomResourceDefinition/backupstoragelocations.velero.io: attempting to create resource
CustomResourceDefinition/backupstoragelocations.velero.io: created
CustomResourceDefinition/deletebackuprequests.velero.io: attempting to create resource
CustomResourceDefinition/deletebackuprequests.velero.io: created
CustomResourceDefinition/downloadrequests.velero.io: attempting to create resource
CustomResourceDefinition/downloadrequests.velero.io: created
CustomResourceDefinition/podvolumebackups.velero.io: attempting to create resource
CustomResourceDefinition/podvolumebackups.velero.io: created
CustomResourceDefinition/podvolumerestores.velero.io: attempting to create resource
CustomResourceDefinition/podvolumerestores.velero.io: created
CustomResourceDefinition/resticrepositories.velero.io: attempting to create resource
CustomResourceDefinition/resticrepositories.velero.io: created
CustomResourceDefinition/restores.velero.io: attempting to create resource
CustomResourceDefinition/restores.velero.io: created
CustomResourceDefinition/schedules.velero.io: attempting to create resource
CustomResourceDefinition/schedules.velero.io: created
CustomResourceDefinition/serverstatusrequests.velero.io: attempting to create resource
CustomResourceDefinition/serverstatusrequests.velero.io: created
CustomResourceDefinition/volumesnapshotlocations.velero.io: attempting to create resource
CustomResourceDefinition/volumesnapshotlocations.velero.io: created
Waiting for resources to be ready in cluster...
Namespace/velero: attempting to create resource
Namespace/velero: created
ClusterRoleBinding/velero: attempting to create resource
ClusterRoleBinding/velero: created
ServiceAccount/velero: attempting to create resource
ServiceAccount/velero: created
Secret/cloud-credentials: attempting to create resource
Secret/cloud-credentials: created
BackupStorageLocation/default: attempting to create resource
BackupStorageLocation/default: created
VolumeSnapshotLocation/default: attempting to create resource
VolumeSnapshotLocation/default: created
Deployment/velero: attempting to create resource
Deployment/velero: created
DaemonSet/restic: attempting to create resource
DaemonSet/restic: created
Velero is installed! ⛵ Use 'kubectl logs deployment/velero -n velero' to view the status.

(openshift) 将 velero ServiceAccount 添加到 privilegedSCC：

1	`$ oc adm policy add-scc-to-user privileged -z velero -n velero`

(openshift) 对于 OpenShift 版本 > = 4.1，修改 DaemonSet yaml 以请求 privileged 模式：

@@ -67,3 +67,5 @@ spec:
              value: /credentials/cloud
            - name: VELERO_SCRATCH_DIR
              value: /scratch
+          securityContext:
+            privileged: true

或:

oc patch ds/restic \
  --namespace velero \
  --type json \
  -p '[{"op":"add","path":"/spec/template/spec/containers/0/securityContext","value": { "privileged": true}}]'

备份 - B 集群

备份集群级别的特定资源

1	`velero backup create <backup-name> --include-cluster-resources=true --include-resources deployments,configmaps`

查看备份

1	`velero backup describe YOUR_BACKUP_NAME`

备份特定 namespace `caseycui2020`

排除特定资源

标签为 velero.io/exclude-from-backup=true 的资源不包括在备份中，即使它包含匹配的选择器标签也是如此。

通过这种方式，不需要备份的 secret 等资源通过 velero.io/exclude-from-backup=true 标签 (label) 进行排除.

通过这种方式排除的 secret 部分示例如下:

1
2
3

builder-dockercfg-jbnzr
default-token-lshh8
pipeline-token-xt645

使用 restic 备份 Pod Volume

🐾 注意:

该 namespace 下，以下 2 个 pod volume 也需要备份，但是目前还没正式使用:

mycoreapphttptask-callback

mycoreapphttptaskservice-callback

通过 “选择性启用” 的方式进行有选择地备份.

对包含要备份的卷的每个 Pod 运行以下命令：

1
2

oc -n caseycui2020 annotate pod/<mybackendapp-pod-name> backup.velero.io/backup-volumes=jmx-exporter-agent,pinpoint-agent,my-mybackendapp-claim
oc -n caseycui2020 annotate pod/<elitegetrecservice-pod-name> backup.velero.io/backup-volumes=uploadfile

其中，卷名是容器 spec 中卷的名称。

例如，对于以下 pod：

apiVersion: v1
kind: Pod
metadata:
  name: sample
  namespace: foo
spec:
  containers:
  - image: k8s.gcr.io/test-webserver
    name: test-webserver
    volumeMounts:
    - name: pvc-volume
      mountPath: /volume-1
    - name: emptydir-volume
      mountPath: /volume-2
  volumes:
  - name: pvc-volume
    persistentVolumeClaim:
      claimName: test-volume-claim
  - name: emptydir-volume
    emptyDir: {}

你应该运行:

1	`kubectl -n foo annotate pod/sample backup.velero.io/backup-volumes=pvc-volume,emptydir-volume`

如果您使用控制器来管理您的 pods，则也可以在 pod template spec 中提供此批注。

备份及验证

备份 namespace 及其对象，以及具有相关 annotation 的 pod volume:

1 2	`# 生产 namespace velero backup create caseycui2020 --include-namespaces caseycui2020`

查看备份

1
2
3

velero backup describe YOUR_BACKUP_NAME
velero backup logs caseycui2020
oc -n velero get podvolumebackups -l velero.io/backup-name=caseycui2020 -o yaml

describe 查看的结果如下:

Name:         caseycui2020
Namespace:    velero
Labels:       velero.io/storage-location=default
Annotations:  velero.io/source-cluster-k8s-gitversion=v1.18.3+2cf11e2
              velero.io/source-cluster-k8s-major-version=1
              velero.io/source-cluster-k8s-minor-version=18+

Phase:  Completed

Errors:    0
Warnimybackendapp:  0

Namespaces:
  Included:  caseycui2020
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto

Label selector:  <none>

Storage Location:  default

Velero-Native Snapshot PVs:  auto

TTL:  720h0m0s

Hooks:  <none>

Backup Format Version:  1.1.0

Started:    2020-10-21 09:28:16 +0800 CST
Completed:  2020-10-21 09:29:17 +0800 CST

Expiration:  2020-11-20 09:28:16 +0800 CST

Total items to be backed up:  591
Items backed up:              591

Velero-Native Snapshots: <none included>

Restic Backups (specify --details for more information):
  Completed:  3

定期备份

使用基于 cron 表达式创建定期计划的备份：

1	`velero schedule create caseycui2020-b-daily --schedule="0 3 * * *" --include-namespaces caseycui2020`

另外，您可以使用一些非标准的速记 cron 表达式：

1	`velero schedule create test-daily --schedule="@every 24h" --include-namespaces caseycui2020`

有关更多用法示例，请参见 cron 软件包的文档。

集群迁移 - 到 A 集群

使用 Backups 和 Restores

只要您将每个 Velero 实例指向相同的云对象存储位置，Velero 就能帮助您将资源从一个群集移植到另一个群集。此方案假定您的群集由同一云提供商托管。请注意，Velero 本身不支持跨云提供程序迁移持久卷快照。如果要在云平台之间迁移卷数据，请启用 restic，它将在文件系统级别备份卷内容。

（集群 B）假设您尚未使用 Velero schedule 操作对数据进行检查点检查，则需要首先备份整个群集（根据需要替换 <BACKUP-NAME>）：
1
velero backup create <BACKUP-NAME>
默认备份保留期限以 TTL（有效期）表示，为 30 天（720 小时）；您可以使用 --ttl <DURATION> 标志根据需要进行更改。有关备份到期的更多信息，请参见 velero 的工作原理。

（集群 A）配置 BackupStorageLocations 和 VolumeSnapshotLocations, 指向 集群 1 使用的位置，使用 velero backup-location create 和 velero snapshot-location create. 要确保配置 BackupStorageLocations 为 read-only, 通过在 velero backup-location create 时使用 --access-mode=ReadOnly flag (因为我就只有一个 bucket, 所以就没有配置只读这一项). 如下为在 A 集群安装，安装时已配置了 BackupStorageLocations 和 VolumeSnapshotLocations.

velero install \
    --provider aws \
    --plugins velero/velero-plugin-for-aws:v1.0.0 \
    --bucket velero \
    --secret-file ./credentials-velero \
    --use-restic \
    --use-volume-snapshots=true \
    --backup-location-config region="default",s3ForcePathStyle="true",s3Url="http://glacier.ewhisper.cn",insecureSkipTLSVerify="true",signatureVersion="4"\
    --snapshot-location-config region="default"

（集群 A）确保已创建 Velero Backup 对象。 Velero 资源与云存储中的备份文件同步。
1
velero backup describe <BACKUP-NAME>
注意：默认同步间隔为 1 分钟，因此请确保在检查之前等待。您可以使用 Velero 服务器的 --backup-sync-period 标志配置此间隔。

（集群 A）一旦确认现在存在正确的备份（<BACKUP-NAME>），就可以使用以下方法还原所有内容： (因为 backup 中只有 caseycui2020 一个 namespace , 所以还原是就不需要 --include-namespaces caseycui2020 进行过滤)

velero restore create --from-backup caseycui2020 --include-resources buildconfigs.build.openshift.io,configmaps,deploymentconfigs.apps.openshift.io,imagestreams.image.openshift.io,imagestreamtags.image.openshift.io,imagetags.image.openshift.io,limitranges,namespaces,networkpolicies.networking.k8s.io,persistentvolumeclaims,prometheusrules.monitoring.coreos.com,resourcequotas,rolebindimybackendapp.authorization.openshift.io,rolebindimybackendapp.rbac.authorization.k8s.io,routes.route.openshift.io,secrets,servicemonitors.monitoring.coreos.com,services,templateinstances.template.openshift.io

因为后面验证 persistentvolumeclaims 的 restore 有问题，所以后续使用的时候先拿掉这个 pvc, 后面再想办法解决:

velero restore create --from-backup caseycui2020 --include-resources buildconfigs.build.openshift.io,configmaps,deploymentconfigs.apps.openshift.io,imagestreams.image.openshift.io,imagestreamtags.image.openshift.io,imagetags.image.openshift.io,limitranges,namespaces,networkpolicies.networking.k8s.io,persistentvolumeclaims,prometheusrules.monitoring.coreos.com,resourcequotas,rolebindimybackendapp.authorization.openshift.io,rolebindimybackendapp.rbac.authorization.k8s.io,routes.route.openshift.io,secrets,servicemonitors.monitoring.coreos.com,services,templateinstances.template.openshift.io

验证 2 个集群

检查第二个群集是否按预期运行：

(集群 A) 运行:

1	`velero restore get`

结果如下:

NAME                       BACKUP      STATUS            STARTED   COMPLETED   ERRORS   WARNImybackendapp   CREATED                         SELECTOR
caseycui2020-20201021102342   caseycui2020   Failed            <nil>     <nil>       0        0          2020-10-21 10:24:14 +0800 CST   <none>
caseycui2020-20201021103040   caseycui2020   PartiallyFailed   <nil>     <nil>       46       34         2020-10-21 10:31:12 +0800 CST   <none>
caseycui2020-20201021105848   caseycui2020   InProgress        <nil>     <nil>       0        0          2020-10-21 10:59:20 +0800 CST   <none>

然后运行:

1 2	`velero restore describe <RESTORE-NAME-FROM-GET-COMMAND> oc -n velero get podvolumerestores -l velero.io/restore-name=YOUR_RESTORE_NAME -o yaml`

结果如下:

Name:         caseycui2020-20201021102342
Namespace:    velero
Labels:       <none>
Annotations:  <none>

Phase:  InProgress

Started:    <n/a>
Completed:  <n/a>

Backup:  caseycui2020

Namespaces:
  Included:  all namespaces found in the backup
  Excluded:  <none>

Resources:
  Included:        buildconfigs.build.openshift.io, configmaps, deploymentconfigs.apps.openshift.io, imagestreams.image.openshift.io, imagestreamtags.image.openshift.io, imagetags.image.openshift.io, limitranges, namespaces, networkpolicies.networking.k8s.io, persistentvolumeclaims, prometheusrules.monitoring.coreos.com, resourcequotas, rolebindimybackendapp.authorization.openshift.io, rolebindimybackendapp.rbac.authorization.k8s.io, routes.route.openshift.io, secrets, servicemonitors.monitoring.coreos.com, services, templateinstances.template.openshift.io
  Excluded:        nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io
  Cluster-scoped:  auto

Namespace mappimybackendapp:  <none>

Label selector:  <none>

Restore PVs:  auto

如果遇到问题，请确保 Velero 在两个群集中的相同 namespace 中运行。

我这边碰到问题，就是 openshift 里边，使用了 imagestream 和 imagetag, 然后对应的镜像拉不过来，容器没有启动.

容器没有启动，podvolume 也没有恢复成功.

Name:         caseycui2020-20201021110424
Namespace:    velero
Labels:       <none>
Annotations:  <none>

Phase:  PartiallyFailed (run 'velero restore logs caseycui2020-20201021110424' for more information)

Started:    <n/a>
Completed:  <n/a>

Warnimybackendapp:
  Velero:     <none>
  Cluster:    <none>
  Namespaces:
    caseycui2020:  could not restore, imagetags.image.openshift.io "mybackendapp:1.0.0" already exists. Warning: the in-cluster version is different than the backed-up version.
                could not restore, imagetags.image.openshift.io "mybackendappno:1.0.0" already exists. Warning: the in-cluster version is different than the backed-up version.
                ...

Errors:
  Velero:     <none>
  Cluster:    <none>
  Namespaces:
    caseycui2020:  error restoring imagestreams.image.openshift.io/caseycui2020/mybackendapp: ImageStream.image.openshift.io "mybackendapp" is invalid: []: Internal error: imagestreams "mybackendapp" is invalid: spec.tags[latest].from.name: Invalid value: "mybackendapp@sha256:6c5ab553a97c74ad602d2427a326124621c163676df91f7040b035fa64b533c7": error generating tag event: imagestreamimage.image.openshift.io ......

Backup:  caseycui2020

Namespaces:
  Included:  all namespaces found in the backup
  Excluded:  <none>

Resources:
  Included:        buildconfigs.build.openshift.io, configmaps, deploymentconfigs.apps.openshift.io, imagestreams.image.openshift.io, imagestreamtags.image.openshift.io, imagetags.image.openshift.io, limitranges, namespaces, networkpolicies.networking.k8s.io, persistentvolumeclaims, prometheusrules.monitoring.coreos.com, resourcequotas, rolebindimybackendapp.authorization.openshift.io, rolebindimybackendapp.rbac.authorization.k8s.io, routes.route.openshift.io, secrets, servicemonitors.monitoring.coreos.com, services, templateinstances.template.openshift.io
  Excluded:        nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io
  Cluster-scoped:  auto

Namespace mappimybackendapp:  <none>

Label selector:  <none>

Restore PVs:  auto

迁移问题总结

目前总结问题如下:

imagestreams.image.openshift.io, imagestreamtags.image.openshift.io, imagetags.image.openshift.io 里的镜像没有成功导入；确切地说是 latest 这个 tag 没有导入成功. imagestreamtags.image.openshift.io 生效也需要时间.
persistentvolumeclaims 迁移过来后报错，报错如下:
1
phase: Lost
原因是: A B 集群的 StorageClass 的配置是不同的，所以 B 集群的 PVC, 在 A 集群想要直接 binding 是不可能的。而且创建后无法直接修改，需要删掉重新创建.
Routes 域名，有部分域名是特定于 A B 集群的域名，如: jenkins-caseycui2020.b.caas.ewhisper.cn 迁移到 A 集群调整为: jenkins-caseycui2020.a.caas.ewhisper.cn
podVolume 数据没有迁移.

`latest` 这个 tag 没有导入成功

手动导入，命令如下: (1.0.1 为 ImageStream 的最新的版本)

1	`oc tag xxl-job-admin:1.0.1 xxl-job-admin:latest`

PVC phase Lost 问题

如果手动创建，需要对 PVC yaml 进行调整。调整前后 PVC 如下:

B 集群原 YAML:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  annotations:
    pv.kubernetes.io/bind-completed: 'yes'
    pv.kubernetes.io/bound-by-controller: 'yes'
    volume.beta.kubernetes.io/storage-provisioner: csi.trident.netapp.io
  selfLink: /api/v1/namespaces/caseycui2020/persistentvolumeclaims/jenkins
  resourceVersion: '77304786'
  name: jenkins
  uid: ffcabc42-845d-4cdf-8c7c-56e97cb5ea82
  creationTimestamp: '2020-10-21T03:05:46Z'
  managedFields:
    - manager: kube-controller-manager
      operation: Update
      apiVersion: v1
      time: '2020-10-21T03:05:46Z'
      fieldsType: FieldsV1
      fieldsV1:
        'f:status':
          'f:phase': {}
    - manager: velero-server
      operation: Update
      apiVersion: v1
      time: '2020-10-21T03:05:46Z'
      fieldsType: FieldsV1
      fieldsV1:
        'f:metadata':
          'f:annotations':
            .: {}
            'f:pv.kubernetes.io/bind-completed': {}
            'f:pv.kubernetes.io/bound-by-controller': {}
            'f:volume.beta.kubernetes.io/storage-provisioner': {}
          'f:labels':
            .: {}
            'f:app': {}
            'f:template': {}
            'f:template.openshift.io/template-instance-owner': {}
            'f:velero.io/backup-name': {}
            'f:velero.io/restore-name': {}
        'f:spec':
          'f:accessModes': {}
          'f:resources':
            'f:requests':
              .: {}
              'f:storage': {}
          'f:storageClassName': {}
          'f:volumeMode': {}
          'f:volumeName': {}
  namespace: caseycui2020
  finalizers:
    - kubernetes.io/pvc-protection
  labels:
    app: jenkins-persistent
    template: jenkins-persistent-monitored
    template.openshift.io/template-instance-owner: 5a0b28c3-c760-451b-b92f-a781406d9e91
    velero.io/backup-name: caseycui2020
    velero.io/restore-name: caseycui2020-20201021110424
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
  volumeName: pvc-414efafd-8b22-48da-8c20-6025a8e671ca
  storageClassName: nas-data
  volumeMode: Filesystem
status:
  phase: Lost

调整后:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: jenkins
  namespace: caseycui2020
  labels:
    app: jenkins-persistent
    template: jenkins-persistent-monitored
    template.openshift.io/template-instance-owner: 5a0b28c3-c760-451b-b92f-a781406d9e91
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
  storageClassName: nas-data
  volumeMode: Filesystem

`podVolume` 数据没有迁移

可以手动迁移，命令如下:

# 登录B集群
# 先把B 集群/opt/prometheus数据拿出来到当前文件夹
oc rsync xxl-job-admin-5-9sgf7:/opt/prometheus .
# 上边rsync命令会创建个prometheus的目录
cd prometheus
# 登录A集群
# 再把数据拷贝进去(拷贝之前得先确保这个pod启动起来) (可以先把`JAVA_OPTS`删掉)
oc rsync ./ xxl-job-admin-2-6k8df:/opt/prometheus/

总结

本文写的比较早，后面 OpenShift 出了基于 Velero 封装的 OpenShift 专有的迁移工具，可以直接通过它提供的工具进行迁移.

另外，OpenShift 集群上限制很多，另外也有很多专属于 OpenShift 的资源，所以在实际使用上和标准 K8S 的差别还是比较大的，需要仔细注意.

本次虽然尝试失败，但是其中的思路还是可供借鉴的.

系列文章

Velero 系列文章

📚️参考文档

Velero Docs - Resource filtering

CloudNative

#K8S #最佳实践 #存储 #对象存储 #Velero #备份 #还原 #迁移 #CSI

Velero 系列文章（四）：使用 Velero 进行生产迁移实战

https://ewhisper.cn/posts/49932/

作者

东风微鸣

发布于

2022年5月25日

许可协议

Velero 系列文章（五）：基于 Velero 的 Kubernetes 集群备份容灾生产最佳实践上一篇

Velero 系列文章（三）：Velero 资源过滤下一篇

Velero 系列文章（四）：使用 Velero 进行生产迁移实战

概述

目的

安装

备份 - B 集群

备份集群级别的特定资源

备份特定 namespace caseycui2020

排除特定资源

使用 restic 备份 Pod Volume

备份及验证

定期备份

集群迁移 - 到 A 集群

验证 2 个集群

迁移问题总结

latest 这个 tag 没有导入成功

PVC phase Lost 问题

podVolume 数据没有迁移

总结

系列文章

📚️参考文档

备份特定 namespace `caseycui2020`

`latest` 这个 tag 没有导入成功

`podVolume` 数据没有迁移