提交 3b2e71ec authored 作者: lidongxu's avatar lidongxu

新版本_完成第一步团队数据的清洗测试

上级 c34ee4a3
__pycache__/
*.py[cod]
*$py.class
.Python
*.so
.venv/
venv/
.env
# 团队转换默认输出目录
code/cache/
# 数据清洗系统 - 环境变量配置
# 通过 ENV 区分环境:不设或 ENV=development 为开发,ENV=production 为生产
# 当前环境:development | production(不写则默认为开发)
ENV=development
# 服务器配置
HOST=0.0.0.0
PORT=8000
DEBUG=False
# ---------- 开发环境数据库(当前 ENV=development 时使用) ----------
DB_HOST=192.168.100.39
DB_PORT=25301
DB_USER=root
DB_PASSWORD="Zt%68Dsuv&M"
DB_NAME=market_bi
# ---------- 生产环境数据库(ENV=production 时使用) ----------
PROD_DB_HOST=rm-2ze28qp55mrm34g8bbo.mysql.rds.aliyuncs.com
PROD_DB_PORT=3306
PROD_DB_USER=sfabus
PROD_DB_PASSWORD=Wxl@325Pa91
PROD_DB_NAME=market_bi
# 日志配置
LOG_LEVEL=INFO
LOG_FILE=./logs/app.log
# Excel 下载配置
EXCEL_DOWNLOAD_TIMEOUT=30
MAX_EXCEL_SIZE=52428800 # 50MB
# 任务超时配置
TASK_TIMEOUT_SECONDS=3600 # 1小时
# ========== Python ==========
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
# 虚拟环境
venv/
.venv/
env/
ENV/
# ========== 测试与覆盖率 ==========
.pytest_cache/
.coverage
htmlcov/
.tox/
.nox/
coverage.xml
*.cover
.hypothesis/
# ========== IDE / 编辑器 ==========
.idea/
.vscode/
*.swp
*.swo
*~
.project
.pydevproject
.settings/
# ========== 系统文件 ==========
.DS_Store
.DS_Store?
Thumbs.db
ehthumbs.db
Desktop.ini
# ========== 日志与临时 ==========
*.log
*.tmp
*.temp
.cache/
# ========== 其他 ==========
*.sql.backup
*.bak
# 数据清洗系统 - 项目说明文档 uvicorn main:app --host 0.0.0.0 --port 8000 --reload
\ No newline at end of file
## 项目概述
本项目是一个使用 FastAPI 框架开发的数据清洗系统,支持从 Excel 文件中提取数据、进行数据清洗处理,并将最终结果保存到 MySQL 数据库。
### 核心功能
1. **Excel 数据解析**:从网络链接下载并解析 Excel 文件
2. **数据清洗处理**:对解析后的数据进行验证、清洗和去重
3. **进度反馈**:通过 HTTP 轮询方式向前端实时反馈数据清洗进度
4. **数据持久化**:将清洗后的数据保存到 MySQL 数据库
---
## 项目结构
```
clean_data/
├── index.py # 主程序入口
├── requirements.txt # 项目依赖列表
├── .env.example # 环境变量配置示例
├── README.md # 项目说明文档
├── core/ # 核心业务模块
│ ├── __init__.py
│ ├── excel_handler.py # Excel 文件处理
│ ├── data_cleaner.py # 数据清洗逻辑
│ ├── db_handler.py # 数据库交互
│ └── progress_manager.py # 进度管理
└── utils/ # 工具模块
├── __init__.py
├── exceptions.py # 自定义异常
└── validators.py # 数据验证
```
---
## 快速开始
### 1. 环境准备
```bash
# 克隆项目(如果需要)
cd clean_data
# 创建虚拟环境(推荐)
python -m venv venv
# 激活虚拟环境
# Windows:
venv\Scripts\activate
# Linux/Mac:
source venv/bin/activate
# 安装依赖
pip install -r requirements.txt
```
### 2. 配置环境变量
```bash
# 复制环境变量配置文件
cp .env.example .env
# 编辑 .env 文件,填写实际的配置信息
# 特别注意:
# - DB_HOST, DB_PORT, DB_USER, DB_PASSWORD 需要填写实际的数据库配置
# - DB_NAME 为要使用的数据库名称
```
### 3. 启动服务
```bash
# 方式一:使用 Python 直接运行
python index.py
# 方式二:使用 Uvicorn 运行(推荐)
uvicorn index:app --host 0.0.0.0 --port 8000 --reload
# 服务将在 http://0.0.0.0:8000 启动
# API 文档:http://localhost:8000/docs(Swagger UI)
```
---
## API 接口文档
### 1. 启动数据清洗任务
**请求**
```
POST /api/v1/clean
```
**请求体**
```json
{
"excel_url": "https://example.com/data.xlsx",
"department": "sales",
"description": "Q1销售数据清洗"
}
```
**响应**
```json
{
"task_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "queued",
"message": "任务已创建,正在处理中...",
"data_preview": null
}
```
### 2. 获取数据清洗进度
**请求**
```
GET /api/v1/progress/{task_id}
```
**响应**
```json
{
"task_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "processing",
"progress": 65,
"message": "已清洗 650/1000 行数据",
"timestamp": "2026-03-06T10:30:45.123456"
}
```
**状态说明**
- `queued`: 任务已创建,排队中
- `processing`: 数据正在处理中
- `completed`: 数据清洗完成
- `failed`: 清洗过程中出错
### 3. 获取清洗结果
**请求**
```
GET /api/v1/result/{task_id}
```
**响应**
```json
{
"task_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "ready_to_save",
"message": "数据清洗完成,可进行保存",
"data_preview": [
{"产品": "产品A", "金额": 1000},
{"产品": "产品B", "金额": 2000}
],
"total_rows": 1000,
"department": "sales"
}
```
### 4. 保存清洗后的数据
**请求**
```
POST /api/v1/save
```
**请求体**
```json
{
"task_id": "550e8400-e29b-41d4-a716-446655440000",
"table_name": "sales_data"
}
```
**响应**
```json
{
"task_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "saved",
"message": "数据已成功保存到数据库",
"affected_rows": 1000
}
```
### 5. 健康检查
**请求**
```
GET /api/v1/health
```
**响应**
```json
{
"status": "healthy",
"timestamp": "2026-03-06T10:30:45.123456",
"service": "数据清洗系统"
}
```
---
## 进度反馈机制
### HTTP 轮询方案(无需 WebSocket)
系统采用 **HTTP 轮询** 方式实现进度反馈,具有以下优势:
1. **无连接保持**:客户端主动请求,降低服务器负载
2. **兼容性强**:支持所有 HTTP 客户端
3. **易于部署**:无需 WebSocket 基础设施
4. **便于扩展**:易于部署到各种云环境
### 前端实现建议
```javascript
// 示例:React/Vue 前端逻辑
const pollProgress = async (taskId) => {
const interval = setInterval(async () => {
try {
const response = await fetch(`/api/v1/progress/${taskId}`);
const data = await response.json();
// 更新进度条
updateProgressBar(data.progress);
updateMessage(data.message);
// 任务完成时停止轮询
if (data.status === 'completed' || data.status === 'failed') {
clearInterval(interval);
}
} catch (error) {
console.error('获取进度失败:', error);
}
}, 1000); // 每秒轮询一次
};
```
---
## 数据清洗逻辑
### 清洗步骤
1. **下载**:从网络链接下载 Excel 文件
2. **解析**:使用 openpyxl 解析 Excel 内容
3. **验证**:验证数据类型和必填字段
4. **清洗**
- 移除首尾空格
- 处理空值
- 去重处理
5. **缓存**:将清洗后的数据存储在内存中
6. **保存**:前端确认后保存到数据库
### 自定义清洗规则
编辑 `core/data_cleaner.py` 中的 `_validate_required_fields` 方法来自定义不同部门的清洗规则:
```python
required_fields_map = {
'sales': ['产品', '金额', '销售日期'],
'inventory': ['SKU', '数量', '仓库'],
'finance': ['交易日期', '金额', '类别']
}
```
---
## 数据库配置
### MySQL 5.6+ 连接配置
编辑 `.env` 文件:
```ini
DB_HOST=localhost
DB_PORT=3306
DB_USER=root
DB_PASSWORD=your_password
DB_NAME=clean_data
```
### 创建目标表(示例)
```sql
CREATE TABLE sales_data (
id INT AUTO_INCREMENT PRIMARY KEY,
产品 VARCHAR(100),
金额 DECIMAL(10, 2),
销售日期 DATE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
```
---
## 异常处理
系统定义了多种自定义异常,便于错误追踪:
- **DataCleaningException**:数据清洗过程中的异常
- **DatabaseException**:数据库操作异常
- **ExcelParsingException**:Excel 解析异常
- **ValidationException**:数据验证异常
所有异常都会被记录到日志中,便于问题排查。
---
## 日志记录
系统使用 Python 标准 logging 模块记录所有操作,日志级别可在 `.env` 中配置:
```
LOG_LEVEL=INFO
LOG_FILE=./logs/app.log
```
日志记录内容包括:
- 任务创建和完成
- 数据处理进度
- 异常错误信息
- 数据库操作记录
---
## 性能优化建议
1. **批量插入**:数据库操作使用批量插入(默认每 1000 行为一批)
2. **异步处理**:使用 FastAPI 的后台任务避免阻塞响应
3. **进度缓存**:使用内存字典缓存进度数据和清洗结果
4. **连接池**:建议为数据库使用连接池(可扩展功能)
---
## 常见问题
### Q: 为什么不使用 WebSocket?
A: HTTP 轮询方案具有以下优势:
- 服务器不需要维持连接状态
- 更容易水平扩展
- 无需 WebSocket 库和基础设施
- 使用标准 HTTP 协议,兼容性更强
### Q: 清洗后的数据存储在哪里?
A: 清洗后的数据默认存储在:
- **短期**:服务器内存中(task_id 映射)
- **长期**:用户确认后保存到 MySQL 数据库
### Q: 如何处理大文件?
A: 可在 `.env` 中配置最大文件大小限制:
```
MAX_EXCEL_SIZE=52428800 # 50MB
```
---
## 扩展功能(可选)
1. **数据备份**:定期备份已保存的数据
2. **审计日志**:记录所有数据修改操作
3. **权限控制**:添加用户认证和授权机制
4. **缓存优化**:使用 Redis 替代内存缓存
5. **任务队列**:使用 Celery 处理大批量任务
---
## 部署建议
### 生产环境
1. 使用 Gunicorn + Uvicorn 运行应用
2. 配置反向代理(nginx)
3. 启用 HTTPS
4. 配置日志持久化
5. 设置监控告警
### Docker 部署
```dockerfile
FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["uvicorn", "index:app", "--host", "0.0.0.0", "--port", "8000"]
```
---
## 技术栈
- **Web 框架**:FastAPI 0.104.1
- **ASGI 服务器**:Uvicorn 0.24.0
- **Excel 处理**:openpyxl 3.10.10
- **数据库驱动**:mysql-connector-python 8.2.0
- **数据验证**:Pydantic 2.5.0
- **HTTP 客户端**:requests 2.31.0
---
## License
MIT
---
## 支持
如有任何问题或建议,请联系开发团队。
# HTTP 路由与子模块
"""将异常与校验失败统一为 { code, data, msg } 响应体。"""
from typing import Any
from fastapi import Request
from fastapi.encoders import jsonable_encoder
from fastapi.exceptions import HTTPException, RequestValidationError
from fastapi.responses import JSONResponse
def _msg_from_detail(detail: str | dict | list | None) -> str:
if detail is None:
return "请求失败"
if isinstance(detail, str):
return detail
if isinstance(detail, dict):
return str(detail.get("error") or detail.get("msg") or detail.get("message") or "请求失败")
return "参数校验失败"
def _data_from_detail(detail: str | dict | list | None) -> Any:
if isinstance(detail, (dict, list)):
return jsonable_encoder(detail)
return None
async def http_exception_handler(request: Request, exc: HTTPException) -> JSONResponse:
body = {
"code": exc.status_code,
"data": _data_from_detail(exc.detail),
"msg": _msg_from_detail(exc.detail),
}
return JSONResponse(status_code=exc.status_code, content=body)
async def validation_exception_handler(request: Request, exc: RequestValidationError) -> JSONResponse:
errors = jsonable_encoder(exc.errors())
return JSONResponse(
status_code=422,
content={"code": 422, "data": errors, "msg": "参数校验失败"},
)
"""统一 API 响应:code=0 成功,非 0 为逻辑/业务错误码;data 为载荷;msg 为说明文案。"""
from typing import Any
from pydantic import BaseModel, Field
class ApiEnvelope(BaseModel):
code: int = Field(..., description="0 成功,非 0 失败")
data: Any = None
msg: str = ""
model_config = {"json_schema_extra": {"example": {"code": 0, "data": {}, "msg": "成功"}}}
def ok(data: Any = None, msg: str = "") -> ApiEnvelope:
return ApiEnvelope(code=0, data=data, msg=msg)
"""清洗相关 HTTP 路由:校验入参、调用团队转换、映射业务错误到 HTTP 状态码。"""
from fastapi import APIRouter, HTTPException
from api.response import ApiEnvelope, ok
from api.schemas import CleanRequestBody
from api.team_conversion_loader import default_team_target_path, run_team_conversion
DEPARTMENT_RISK_AUDIT_CLEAN = "风控稽查数据清洗"
api_router = APIRouter(prefix="/api")
def _audit_date_str_from_body(body: CleanRequestBody) -> str | None:
if body.year is None or body.month is None or body.day is None:
return None
return f"{body.year:04d}{body.month:02d}{body.day:02d}"
def _raise_http_for_failed_result(result: dict) -> None:
"""团队转换返回 ok=False 时,按 error 文案选择状态码。"""
err = result.get("error") or ""
if "source_url 须为" in err:
raise HTTPException(status_code=400, detail=result)
if "从 URL 读取源表失败" in err or err.startswith("读取源表失败"):
raise HTTPException(status_code=502, detail=result)
if result.get("message") and "error" not in result:
return
raise HTTPException(status_code=500, detail=result)
@api_router.post("/v1/clean", response_model=ApiEnvelope)
def post_clean(body: CleanRequestBody) -> ApiEnvelope:
dept = (body.department or "").strip()
if dept != DEPARTMENT_RISK_AUDIT_CLEAN:
raise HTTPException(
status_code=400,
detail={
"ok": False,
"error": f"不支持的 department: {dept!r},当前仅支持「{DEPARTMENT_RISK_AUDIT_CLEAN}」",
},
)
team_url = (body.team_url or "").strip()
team_target = (body.team_target_path or "").strip() or default_team_target_path()
if not team_url:
raise HTTPException(
status_code=400,
detail={"ok": False, "error": "team_url 不能为空"},
)
audit_date_str = _audit_date_str_from_body(body)
result = run_team_conversion(team_url, team_target, audit_date_str)
if result.get("ok"):
return ok(data=result, msg="成功")
_raise_http_for_failed_result(result)
return ok(data=result, msg=str(result.get("message") or ""))
"""清洗接口请求体。"""
from pydantic import BaseModel, Field
class CleanRequestBody(BaseModel):
department: str = Field(..., description="业务类型,风控稽查数据清洗 走团队转换")
year: int | None = None
month: int | None = None
day: int | None = None
team_url: str | None = None
team_target_path: str | None = None # 默认:项目下 cache/team_时间戳.xlsx
puling_url: str | None = None
chengyu_url: str | None = None
"""动态加载团队转换脚本(历史路径/中文文件名),对外只暴露可调用入口与路径工具。"""
import importlib.util
from datetime import datetime
from pathlib import Path
from typing import Any, Callable
_CODE_BASE = Path(__file__).resolve().parent.parent
_TEAM_SCRIPT = _CODE_BASE / "py_" / "audit" / "point_sale" / "data_conversion.py"
def _load_run_team_conversion() -> Callable[..., dict[str, Any]]:
spec = importlib.util.spec_from_file_location("team_data_convert", _TEAM_SCRIPT)
if spec is None or spec.loader is None:
raise RuntimeError(f"无法加载团队转换模块: {_TEAM_SCRIPT}")
mod = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mod)
fn = getattr(mod, "run_team_conversion", None)
if fn is None:
raise RuntimeError("data_conversion 中缺少 run_team_conversion")
return fn
run_team_conversion: Callable[..., dict[str, Any]] = _load_run_team_conversion()
def default_team_target_path() -> str:
"""未传路径时:cache/team_{时间戳}.xlsx"""
d = _CODE_BASE / "cache"
d.mkdir(parents=True, exist_ok=True)
ts = datetime.now().strftime("%Y%m%d_%H%M%S")
return str(d / f"team_{ts}.xlsx")
/*
Navicat MySQL Data Transfer
Source Server : t100_production
Source Server Version : 50744
Source Host : rm-2ze28qp55mrm34g8bbo.mysql.rds.aliyuncs.com:3306
Source Database : market_bi
Target Server Type : MYSQL
Target Server Version : 50744
File Encoding : 65001
Date: 2026-03-12 11:37:31
*/
SET FOREIGN_KEY_CHECKS=0;
-- ----------------------------
-- Table structure for bi_price_xx
-- ----------------------------
DROP TABLE IF EXISTS `bi_price_xx`;
CREATE TABLE `bi_price_xx` (
`id` varchar(255) COLLATE utf8mb4_bin DEFAULT NULL COMMENT '主键',
`bi_product` varchar(255) COLLATE utf8mb4_bin DEFAULT NULL COMMENT '产品系统',
`prd_name` varchar(255) COLLATE utf8mb4_bin DEFAULT NULL COMMENT '口味',
`pro_weight` varchar(255) COLLATE utf8mb4_bin DEFAULT NULL COMMENT '产品克重',
`channel_type` varchar(255) COLLATE utf8mb4_bin DEFAULT NULL COMMENT '渠道',
`creator` varchar(255) COLLATE utf8mb4_bin DEFAULT NULL COMMENT '提交人',
`modifier` varchar(255) COLLATE utf8mb4_bin DEFAULT NULL COMMENT '修改人',
`creator_nickname` varchar(255) COLLATE utf8mb4_bin DEFAULT NULL COMMENT '提交人昵称',
`modifier_nickname` varchar(255) COLLATE utf8mb4_bin DEFAULT NULL COMMENT '修改人昵称',
`create_time` datetime DEFAULT NULL COMMENT '提交时间',
`modify_time` datetime DEFAULT NULL COMMENT '修改时间',
`qbi_system_upload_id` bigint(30) DEFAULT NULL COMMENT '上传批次主键',
`low_price` decimal(30,2) DEFAULT NULL COMMENT '低价',
`normal_price` decimal(30,2) DEFAULT NULL COMMENT '零售价'
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin COMMENT='线下价盘表';
-- ----------------------------
-- Records of bi_price_xx
-- ----------------------------
INSERT INTO `bi_price_xx` VALUES ('7ac70b27-59d8-413c-81e4-d11d3e753ca2', '虎皮凤爪', '全品味', '105g', 'KA', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '14.90', '21.89');
INSERT INTO `bi_price_xx` VALUES ('3ebea3f4-4e40-4088-aacb-d5d7b3194d82', '虎皮凤爪', '全品味', '210g', 'KA', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '26.50', '38.39');
INSERT INTO `bi_price_xx` VALUES ('344df49c-597d-4826-b659-2fd1638edc45', '虎皮凤爪', '全品味', '散称', 'KA', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '45.80', '54.78');
INSERT INTO `bi_price_xx` VALUES ('21f14713-eb58-4090-8df0-aa8a5b98f3e8', '去骨凤爪', '全品味', '72g', 'KA', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '14.90', '21.89');
INSERT INTO `bi_price_xx` VALUES ('7d222a14-8191-4f88-9fec-2fa0e21c27cc', '去骨凤爪', '全品味', '138g', 'KA', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '26.50', '38.39');
INSERT INTO `bi_price_xx` VALUES ('3d2ced09-1e88-46ef-87a3-e2368e5aaff1', '脆笋去骨', '全品味', '散称', 'KA', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '45.80', '50.38');
INSERT INTO `bi_price_xx` VALUES ('bcc8e320-0db0-4535-9fd0-3e79585881c8', '老卤凤爪', '全品味', '95g', 'KA', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '14.90', '21.89');
INSERT INTO `bi_price_xx` VALUES ('f795e572-7b8a-41f9-b2a4-678803019224', '老卤鸭掌', '全品味', '95g', 'KA', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '14.90', '21.89');
INSERT INTO `bi_price_xx` VALUES ('de4bcae9-6ace-43b1-b88f-7c70ca674b64', '鸡肉豆堡', '全品味', '120g', 'KA', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '9.90', '14.19');
INSERT INTO `bi_price_xx` VALUES ('4949ad7f-8f90-4bf9-8848-e862b210e3e0', '虎皮小鸡腿', '全品味', '80g', 'KA', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '12.80', '18.48');
INSERT INTO `bi_price_xx` VALUES ('4286df2e-dcf4-4dd5-a25e-c26d8a3c8829', '虎皮凤爪', '全品味', '105g', 'BC', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '14.90', '21.89');
INSERT INTO `bi_price_xx` VALUES ('e8b863b9-ce10-4a33-b142-6019c62aee93', '虎皮凤爪', '全品味', '210g', 'BC', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '26.50', '38.39');
INSERT INTO `bi_price_xx` VALUES ('07ddcf94-619e-468a-9ad7-67b1835c5898', '虎皮凤爪', '全品味', '68g', 'BC', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '9.80', '14.19');
INSERT INTO `bi_price_xx` VALUES ('0a193d95-e8ec-4e12-abe0-5dc465b1e1bd', '虎皮凤爪', '全品味', '25g', 'BC', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '3.90', '5.39');
INSERT INTO `bi_price_xx` VALUES ('347aa860-eadf-4ae0-aa31-7a64f15322b3', '虎皮凤爪', '全品味', '散称', 'BC', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '45.80', '54.78');
INSERT INTO `bi_price_xx` VALUES ('24e265c7-d526-4b30-a981-2eae56ac2227', '去骨凤爪', '全品味', '72g', 'BC', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '14.90', '21.89');
INSERT INTO `bi_price_xx` VALUES ('71a70334-e06a-43a0-913e-e81d1eafcf94', '老卤凤爪', '全品味', '95g', 'BC', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '14.90', '21.89');
INSERT INTO `bi_price_xx` VALUES ('b9686a81-9771-4c4b-8c4f-faf0b5fdf71f', '老卤鸭掌', '全品味', '95g', 'BC', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '14.90', '21.89');
INSERT INTO `bi_price_xx` VALUES ('350cc768-1407-40ec-b9c4-4a336cd89118', '鸡肉豆堡', '全品味', '120g', 'BC', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '9.90', '14.19');
INSERT INTO `bi_price_xx` VALUES ('5c67fa66-799b-43bc-8263-b84291ec7c44', '鸡肉豆堡', '全品味', '散称', 'BC', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '34.90', '38.39');
INSERT INTO `bi_price_xx` VALUES ('09084915-3a6a-4339-a8b1-f26ca803aec6', '虎皮小鸡腿', '全品味', '80g', 'BC', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '12.80', '18.48');
INSERT INTO `bi_price_xx` VALUES ('54b65a1e-41b2-4d86-9895-64eb03f33ed0', '虎皮凤爪', '全品味', '105g', 'CVS', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '14.90', '21.89');
INSERT INTO `bi_price_xx` VALUES ('44b67fcf-42ac-4b93-83e0-da427b013737', '虎皮凤爪', '全品味', '210g', 'CVS', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '29.50', '38.39');
INSERT INTO `bi_price_xx` VALUES ('d5dfc92a-a075-4277-938b-269f103aa76f', '虎皮凤爪', '全品味', '68g', 'CVS', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '10.90', '14.19');
INSERT INTO `bi_price_xx` VALUES ('ce9500ff-6671-4514-8a40-6a84c79ab1a3', '虎皮凤爪', '全品味', '25g', 'CVS', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '3.90', '5.39');
INSERT INTO `bi_price_xx` VALUES ('0a34f1bc-7e5e-4cf3-a059-da9a29d0bad8', '去骨凤爪', '全品味', '72g', 'CVS', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '14.90', '21.89');
INSERT INTO `bi_price_xx` VALUES ('0bf423ef-5275-43ce-80b4-dac3662b21c2', '去骨凤爪', '全品味', '138g', 'CVS', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '29.50', '38.39');
INSERT INTO `bi_price_xx` VALUES ('0ebf0fe1-25cd-4b1c-bb16-698d714a90f6', '老卤凤爪', '全品味', '95g', 'CVS', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '14.90', '21.89');
INSERT INTO `bi_price_xx` VALUES ('3456b8c9-772f-4c8b-a2a1-a2562273839f', '老卤鸭掌', '全品味', '95g', 'CVS', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '14.90', '21.89');
INSERT INTO `bi_price_xx` VALUES ('55957f41-d285-4dac-b661-b1c6f7faf4c1', '鸡肉豆堡', '全品味', '120g', 'CVS', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '9.90', '14.19');
INSERT INTO `bi_price_xx` VALUES ('44274c3b-697b-401b-9670-b663a235cbf9', '虎皮小鸡腿', '全品味', '80g', 'CVS', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '12.80', '18.48');
INSERT INTO `bi_price_xx` VALUES ('fb961ee5-3d1c-406b-81f0-68f73a24e82a', '虎皮凤爪', '全品味', '68g', '零食', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '8.80', '14.19');
INSERT INTO `bi_price_xx` VALUES ('e400d8e8-19ce-41dd-b02e-391933d76d43', '虎皮凤爪', '全品味', '散称', '零食', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '45.80', '54.78');
INSERT INTO `bi_price_xx` VALUES ('edacdcb9-6450-4965-870c-979f5aa0fdf3', '脆笋去骨', '全品味', '散称', '零食', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '45.80', '50.38');
INSERT INTO `bi_price_xx` VALUES ('d75c4276-56e6-4296-b389-8ea2ff386d48', '鸡肉豆堡', '全品味', '散称', '零食', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '34.90', '38.39');
INSERT INTO `bi_price_xx` VALUES ('b748e150-82ad-447d-89f7-7a5b4efc5b25', '虎皮小鸡腿', '全品味', '散称', '零食', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '35.80', '39.38');
INSERT INTO `bi_price_xx` VALUES ('a8854d47-e1a3-4673-afb5-9b7d0fa8d183', '虎皮凤爪', '全品味', '105g', '批发', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '10.40', '9.90');
INSERT INTO `bi_price_xx` VALUES ('b93e7fef-8ecf-4434-9441-da61b839069b', '虎皮凤爪', '全品味', '210g', '批发', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '19.11', '18.20');
INSERT INTO `bi_price_xx` VALUES ('cc8f1359-2afc-464c-8f0f-62e985e99fb7', '虎皮凤爪', '全品味', '68g', '批发', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '6.83', '6.50');
INSERT INTO `bi_price_xx` VALUES ('5711f296-0451-48c6-bcf7-efdb68d5867b', '虎皮凤爪', '全品味', '25g', '批发', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '2.63', '2.50');
INSERT INTO `bi_price_xx` VALUES ('aab9599c-2a33-437d-8ef8-890ed1bec685', '虎皮凤爪', '全品味', '散称', '批发', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '38.85', '37.00');
INSERT INTO `bi_price_xx` VALUES ('008ff458-bc4d-4fa8-9612-1b50fcd86992', '去骨凤爪', '全品味', '72g', '批发', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '10.40', '9.90');
INSERT INTO `bi_price_xx` VALUES ('d49be613-998a-4fd9-b629-ba01fc51876b', '去骨凤爪', '全品味', '138g', '批发', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '17.33', '16.50');
INSERT INTO `bi_price_xx` VALUES ('8a2eab0f-f4f7-4192-9608-f2b6a7e17482', '老卤凤爪', '全品味', '95g', '批发', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '10.40', '9.90');
INSERT INTO `bi_price_xx` VALUES ('069d4f5e-deb0-43fb-af81-d812f1f23465', '老卤鸭掌', '全品味', '95g', '批发', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '10.40', '9.90');
INSERT INTO `bi_price_xx` VALUES ('e2a057f4-1890-4096-bb7b-80214b86d94d', '鸡肉豆堡', '全品味', '120g', '批发', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '6.83', '6.50');
INSERT INTO `bi_price_xx` VALUES ('b07e83c2-7be9-49e2-b42e-12267010ed0e', '虎皮小鸡腿', '全品味', '80g', '批发', '86f47c35e2d4477d838e1280a949028b', '86f47c35e2d4477d838e1280a949028b', '王璐璐', '王璐璐', '2026-03-12 11:31:37', '2026-03-12 11:31:37', '377410', '8.40', '8.00');
"""
配置管理模块
负责读取和管理应用配置
通过环境变量 ENV=development|production 自动区分开发/生产环境
"""
import os
from typing import Optional
from dotenv import load_dotenv
# 加载 .env 文件(使用绝对路径,避免因工作目录不同导致加载失败)
_env_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), '.env')
load_dotenv(dotenv_path=_env_path)
# 环境标识:development | production,未设置时默认为开发环境
_ENV = os.getenv("ENV", "development").strip().lower()
IS_PRODUCTION = _ENV == "production"
IS_DEV = not IS_PRODUCTION
def _db_var(key: str, dev_default: str, prod_default: str = "") -> str:
"""按环境读取数据库相关变量:生产环境优先读 PROD_DB_*,否则读 DB_*"""
if IS_PRODUCTION:
return os.getenv(f"PROD_DB_{key}", os.getenv(f"DB_{key}", prod_default)) or prod_default
return os.getenv(f"DB_{key}", dev_default)
class Config:
"""应用配置类"""
# 环境
ENV: str = _ENV
IS_PRODUCTION: bool = IS_PRODUCTION
IS_DEV: bool = IS_DEV
# 服务器配置(生产环境默认关闭 DEBUG)
HOST: str = os.getenv("HOST", "0.0.0.0")
PORT: int = int(os.getenv("PORT", "8000"))
DEBUG: bool = os.getenv("DEBUG", "false" if IS_PRODUCTION else "true").lower() == "true"
# 数据库配置:开发用 DB_*,生产用 PROD_DB_*(或系统环境变量覆盖)
DB_HOST: str = _db_var("HOST", "localhost")
DB_PORT: int = int(_db_var("PORT", "3306"))
DB_USER: str = _db_var("USER", "root")
DB_PASSWORD: str = _db_var("PASSWORD", "")
DB_NAME: str = _db_var("NAME", "clean_data")
# 日志配置
LOG_LEVEL: str = os.getenv("LOG_LEVEL", "INFO")
LOG_FILE: Optional[str] = os.getenv("LOG_FILE")
# Excel 下载配置
EXCEL_DOWNLOAD_TIMEOUT: int = int(os.getenv("EXCEL_DOWNLOAD_TIMEOUT", "30"))
MAX_EXCEL_SIZE: int = int(os.getenv("MAX_EXCEL_SIZE", "52428800")) # 50MB
# 任务超时配置
TASK_TIMEOUT_SECONDS: int = int(os.getenv("TASK_TIMEOUT_SECONDS", "3600")) # 1小时
@classmethod
def get_db_config(cls) -> dict:
"""获取数据库配置字典"""
return {
'host': cls.DB_HOST,
'port': cls.DB_PORT,
'user': cls.DB_USER,
'password': cls.DB_PASSWORD,
'database': cls.DB_NAME,
}
# 创建全局配置实例
config = Config()
"""Core 业务模块"""
"""
数据清洗模块
负责数据的清洗和验证逻辑
"""
import logging
import asyncio
import pandas as pd
from typing import List, Dict, Any, Callable, Optional
logger = logging.getLogger(__name__)
# 各 department 对应的清洗策略注册表
# key: department 名称, value: (transform函数, 产品组配置, 稽查来源名称)
_DEPARTMENT_CLEANERS = {}
def _load_department_cleaners():
"""非专用清洗逻辑"""
global _DEPARTMENT_CLEANERS
if _DEPARTMENT_CLEANERS: # 如果部门清洗模块已加载,则直接返回
return
try:
# 加载部门清洗使用的工具
from core_py.数据转换_团队 import (
transform as _team_transform,
PRODUCT_GROUPS_JC,
) # PRODUCT_GROUPS_JC 风控稽查数据清洗配置数据
_DEPARTMENT_CLEANERS["风控稽查数据清洗"] = (_team_transform, PRODUCT_GROUPS_JC, "稽查团队")
logger.info("已加载部门清洗模块: 风控稽查数据清洗")
except ImportError as e:
logger.warning(f"加载团队清洗模块失败: {e}")
class DataCleaner:
"""数据清洗类"""
def __init__(self):
self.rules = {}
async def clean(
self,
raw_data: List[Dict[str, Any]],
department: str,
progress_callback: Optional[Callable[[float, str, Optional[int]], None]] = None,
audit_date: Optional[str] = None,
) -> List[Dict[str, Any]]:
"""
清洗数据
Args:
raw_data: 原始数据列表(每行为 dict,key 为列名)
department: 业务部门名称,如 "团队"
progress_callback: 进度回调函数,接收 (progress: 0-1, message: str)
audit_date: 稽查日期字符串,格式 'yyyy-mm-dd';为 None 时由各清洗模块自动取上月1号
Returns:
List[Dict]: 清洗后的数据
"""
try:
logger.info(f"开始清洗数据,部门: {department},数据行数: {len(raw_data)}")
# ── 专项清洗路由 ──────────────────────────────────────────────
_load_department_cleaners()
if department in _DEPARTMENT_CLEANERS:
return await self._clean_by_department(
raw_data, department, progress_callback, audit_date=audit_date
)
# ─────────────────────────────────────────────────────────────
total_rows = len(raw_data)
cleaned_data = []
for idx, row in enumerate(raw_data):
try:
cleaned_row = await self._validate_and_convert(row, department)
if cleaned_row and not self._is_duplicate(
cleaned_row, cleaned_data
):
cleaned_data.append(cleaned_row)
if progress_callback and idx % max(1, total_rows // 10) == 0:
progress = idx / total_rows if total_rows > 0 else 0
progress_callback(progress, f"已清洗 {idx}/{total_rows} 行数据", len(cleaned_data))
except Exception as e:
logger.warning(f"第 {idx + 1} 行数据清洗失败: {str(e)}")
continue
if progress_callback:
progress_callback(1.0, f"清洗完成,共 {len(cleaned_data)} 行有效数据", len(cleaned_data))
logger.info(
f"数据清洗完成,原始行数: {total_rows},清洗后行数: {len(cleaned_data)}"
)
return cleaned_data
except Exception as e:
logger.error(f"clean 方法执行失败: {str(e)}")
raise
async def _clean_by_department(
self,
raw_data: List[Dict[str, Any]],
department: str,
progress_callback: Optional[Callable[[float, str, Optional[int]], None]] = None,
audit_date: Optional[str] = None,
) -> List[Dict[str, Any]]:
"""
调用对应部门的专项 transform 函数进行清洗。
raw_data 来自 excel_handler(List[Dict],key 为列名),
transform 函数通过 iloc 按位置访问列,因此转换为 DataFrame 时
只要列顺序与原始 Excel 一致,iloc 索引就能正确对应。
"""
transform_fn, pg, yname = _DEPARTMENT_CLEANERS[department]
if progress_callback:
progress_callback(0.1, "正在转换数据格式", None)
# List[Dict] → DataFrame(保留原始列顺序,iloc 索引与 Excel 列位置对应)
df = pd.DataFrame(raw_data)
if progress_callback:
progress_callback(0.3, f"正在执行 {department} 数据清洗", None)
# transform 是同步函数,用 asyncio.to_thread 避免阻塞事件循环
records = await asyncio.to_thread(transform_fn, df, yname, pg, audit_date)
if progress_callback:
progress_callback(1.0, f"清洗完成,共 {len(records)} 行有效数据", len(records))
logger.info(f"[{department}] 专项清洗完成,共 {len(records)} 条记录")
return records
async def _validate_and_convert(
self, row: Dict[str, Any], department: str
) -> Optional[Dict[str, Any]]:
"""
验证和转换单行数据
Args:
row: 数据行
department: 业务部门名称
Returns:
转换后的数据行,若无效则返回 None
"""
try:
cleaned_row = {}
for key, value in row.items():
if value is None or (isinstance(value, str) and not value.strip()):
# 空值处理
cleaned_row[key] = None
continue
# 字符串数据清洗
if isinstance(value, str):
cleaned_row[key] = value.strip()
else:
cleaned_row[key] = value
# 验证必填字段(根据部门调整规则)
if not self._validate_required_fields(cleaned_row, department):
return None
return cleaned_row
except Exception as e:
logger.warning(f"_validate_and_convert 失败: {str(e)}")
return None
def _validate_required_fields(self, row: Dict[str, Any], department: str) -> bool:
"""
验证必填字段
Args:
row: 数据行
department: 业务部门
Returns:
bool: 是否通过验证
"""
# 示例:可根据部门定义不同的必填字段规则
required_fields_map = {
"sales": ["产品", "金额"],
"inventory": ["SKU", "数量"],
"finance": ["交易日期", "金额"],
}
required_fields = required_fields_map.get(department, [])
# 检查必填字段是否存在且非空
for field in required_fields:
if field not in row or row[field] is None:
return False
return True
def _is_duplicate(
self, row: Dict[str, Any], existing_data: List[Dict[str, Any]]
) -> bool:
"""
检查行是否为重复数据
Args:
row: 当前行
existing_data: 已有数据列表
Returns:
bool: 是否为重复
"""
# 简单的重复检查(可扩展为更复杂的逻辑)
for existing_row in existing_data:
if row == existing_row:
return True
return False
"""
数据库处理模块
负责与 MySQL 数据库的交互
"""
import logging
import mysql.connector
from typing import List, Dict, Any
from contextlib import contextmanager
from config import config
logger = logging.getLogger(__name__)
class DatabaseHandler:
"""数据库处理类"""
def __init__(self):
"""初始化数据库配置"""
self.db_config = {
'host': config.DB_HOST,
'user': config.DB_USER,
'password': config.DB_PASSWORD,
'database': config.DB_NAME,
'port': config.DB_PORT,
'autocommit': False,
'connection_timeout': 10
}
@contextmanager
def _get_connection(self):
"""
获取数据库连接的上下文管理器
Yields:
mysql.connector.MySQLConnection: 数据库连接
Raises:
Exception: 连接失败时抛出异常
"""
connection = None
try:
connection = mysql.connector.connect(**self.db_config)
logger.info("数据库连接成功")
yield connection
except mysql.connector.Error as e:
logger.error(f"数据库连接失败: {str(e)}")
raise
finally:
if connection and connection.is_connected():
connection.close()
logger.info("数据库连接已关闭")
async def insert_data(
self,
table_name: str,
data: List[Dict[str, Any]]
) -> int:
"""
将数据 upsert 到指定的表(首次写入为 INSERT,命中唯一键时覆盖更新)。
MySQL ON DUPLICATE KEY UPDATE 行为说明:
- 新行插入:rowcount += 1
- 已有行被更新:rowcount += 2
- 数据与现有行完全一致(无变化):rowcount += 0
Args:
table_name: 目标表名
data: 数据列表
Returns:
tuple[int, int]: (submitted_rows, raw_affected)
- submitted_rows: 提交处理的总行数(去重后传入的行数,即预估真实入库行数)
- raw_affected: MySQL 累计 rowcount 原始值(insert=+1, update=+2, 无变化=+0)
Raises:
Exception: 插入失败时抛出异常
"""
if not data:
logger.warning("插入的数据为空")
return 0
try:
with self._get_connection() as connection:
cursor = connection.cursor()
# 获取字段名
columns = list(data[0].keys())
column_names = ', '.join([f'`{col}`' for col in columns])
placeholders = ', '.join(['%s'] * len(columns))
# ON DUPLICATE KEY UPDATE:命中唯一键时覆盖所有字段值
update_clause = ', '.join([f'`{col}` = VALUES(`{col}`)' for col in columns])
upsert_sql = f"""
INSERT INTO `{table_name}` ({column_names})
VALUES ({placeholders})
ON DUPLICATE KEY UPDATE {update_clause}
"""
logger.info(f"准备 upsert {len(data)} 行数据到表 {table_name}")
# 批量 upsert
# ON DUPLICATE KEY UPDATE 的 rowcount 含义:insert=1,update=2,无变化=0
# 真实入库(新增)行数 = rowcount // 1 的部分;用 lastrowid 变化量计算最准,
# 但批量时不可用。此处用最简单可靠的方案:
# raw_affected 累加 rowcount 原始值,
# insert_rows = raw_affected 中 rowcount==1 的部分(需逐条统计)
# 由于 executemany 只返回总 rowcount,改为逐条 execute 才能精确区分。
# 权衡性能与精度,保留 executemany 批量写入,同时返回原始 raw_affected,
# 并在 log 中说明换算公式,调用方按需解读。
raw_affected = 0
for batch_start in range(0, len(data), 1000):
batch_end = min(batch_start + 1000, len(data))
batch_data = data[batch_start:batch_end]
values_list = [
tuple(row.get(col) for col in columns)
for row in batch_data
]
cursor.executemany(upsert_sql, values_list)
raw_affected += cursor.rowcount
logger.info(f"已处理 {batch_end} / {len(data)} 行数据")
connection.commit()
# 查询本次 upsert 后表中实际存在的行数(含历史数据),
# 以及本批次真实写入行数:
# insert_rows ≈ raw_affected 中 rowcount=1 的行(executemany 无法细分)
# upsert_rows = raw_affected(去掉无变化的0,insert贡献1,update贡献2)
# 用 (raw_affected + 批次总行数) / 3 可估算 update 行数,但不精确。
# 最可靠的语义:把传入行数作为"提交处理行数",raw_affected 作为辅助信息。
submitted_rows = len(data)
cursor.close()
logger.info(
f"upsert 完成:提交 {submitted_rows} 行,"
f"raw_affected={raw_affected}(insert+1 / update+2 / 无变化+0)"
)
# 返回 (submitted_rows, raw_affected) 元组,由调用方决定展示哪个
return submitted_rows, raw_affected
except mysql.connector.Error as e:
logger.error(f"MySQL 错误: {str(e)}")
raise
except Exception as e:
logger.error(f"insert_data 失败: {str(e)}")
raise
async def test_connection(self) -> bool:
"""
测试数据库连接
Returns:
bool: 连接是否成功
"""
try:
with self._get_connection() as connection:
cursor = connection.cursor()
cursor.execute("SELECT 1")
cursor.fetchone()
cursor.close()
return True
except Exception as e:
logger.error(f"数据库连接测试失败: {str(e)}")
return False
async def create_table_if_not_exists(
self,
table_name: str,
schema: Dict[str, str]
) -> bool:
"""
如果表不存在则创建表
Args:
table_name: 表名
schema: 表架构定义 {列名: 列定义}
Returns:
bool: 是否创建成功或表已存在
"""
try:
with self._get_connection() as connection:
cursor = connection.cursor()
# 检查表是否存在
cursor.execute(f"""
SELECT TABLE_NAME FROM information_schema.TABLES
WHERE TABLE_SCHEMA = '{self.db_config['database']}'
AND TABLE_NAME = '{table_name}'
""")
if cursor.fetchone():
logger.info(f"表 {table_name} 已存在")
cursor.close()
return True
# 创建表
columns_sql = ', '.join([f'`{col}` {definition}' for col, definition in schema.items()])
create_sql = f"""
CREATE TABLE `{table_name}` (
id INT AUTO_INCREMENT PRIMARY KEY,
{columns_sql},
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
"""
cursor.execute(create_sql)
connection.commit()
cursor.close()
logger.info(f"成功创建表 {table_name}")
return True
except Exception as e:
logger.error(f"create_table_if_not_exists 失败: {str(e)}")
raise
"""
Excel 文件处理模块
负责从 URL 下载和解析 Excel 文件
"""
import aiohttp
import logging
from openpyxl import load_workbook
from io import BytesIO
from typing import List, Dict, Any
import os
import tempfile
logger = logging.getLogger(__name__)
class ExcelHandler:
"""Excel 文件处理类"""
def __init__(self):
self.timeout = aiohttp.ClientTimeout(total=30)
async def fetch_bytes(self, url: str) -> bytes:
"""
从 URL 下载文件,返回原始字节内容(供调用方自行用 pandas 解析)
Args:
url: 文件的网络链接
Returns:
bytes: 文件的原始二进制内容
"""
try:
logger.info(f"开始从 {url} 下载文件")
async with aiohttp.ClientSession(timeout=self.timeout) as session:
async with session.get(url) as response:
if response.status != 200:
raise Exception(f"下载失败,HTTP 状态码: {response.status}")
content = await response.read()
logger.info(f"下载完成,文件大小: {len(content)} 字节")
return content
except Exception as e:
logger.error(f"fetch_bytes 失败: {str(e)}")
raise
async def fetch_and_parse(self, excel_url: str) -> List[Dict[str, Any]]:
"""
从 URL 下载并解析 Excel 文件
Args:
excel_url: Excel 文件的网络链接
Returns:
List[Dict]: 解析后的数据,每行为一个字典
Raises:
Exception: 下载或解析失败时抛出异常
"""
try:
# 1. 下载文件
logger.info(f"开始从 {excel_url} 下载 Excel 文件")
async with aiohttp.ClientSession(timeout=self.timeout) as session:
async with session.get(excel_url) as response:
if response.status != 200:
raise Exception(f"下载失败,HTTP 状态码: {response.status}")
excel_content = await response.read()
logger.info(f"下载完成,文件大小: {len(excel_content)} 字节")
# 2. 解析 Excel
return self._parse_excel_content(excel_content)
except Exception as e:
logger.error(f"fetch_and_parse 失败: {str(e)}")
raise
def _parse_excel_content(self, excel_content: bytes) -> List[Dict[str, Any]]:
"""
解析 Excel 内容
Args:
excel_content: Excel 文件的二进制内容
Returns:
List[Dict]: 解析后的数据
"""
try:
# 使用 BytesIO 从内存中读取
excel_file = BytesIO(excel_content)
workbook = load_workbook(excel_file)
# 获取第一个工作表
worksheet = workbook.active
if not worksheet:
raise Exception("Excel 文件不包含有效的工作表")
# 获取标题行
headers = []
for cell in worksheet[1]:
headers.append(cell.value)
if not headers or all(h is None for h in headers):
raise Exception("Excel 文件不包含有效的标题行")
# 解析数据行
data = []
for row in worksheet.iter_rows(min_row=2, values_only=False):
row_data = {}
for idx, cell in enumerate(row):
if idx < len(headers):
row_data[headers[idx]] = cell.value
# 跳过空行
if any(v is not None for v in row_data.values()):
data.append(row_data)
logger.info(f"成功解析 Excel,共 {len(data)} 行数据")
return data
except Exception as e:
logger.error(f"_parse_excel_content 失败: {str(e)}")
raise
"""
进度管理模块
负责任务进度的记录和查询
"""
import logging
from typing import Dict, Any, Optional
from datetime import datetime, timedelta
import threading
logger = logging.getLogger(__name__)
class ProgressManager:
"""进度管理类"""
def __init__(self, timeout_seconds: int = 3600):
"""
初始化进度管理器
Args:
timeout_seconds: 任务进度的过期时间(秒),默认 1 小时
"""
self.progress_data: Dict[str, Dict[str, Any]] = {}
self.timeout_seconds = timeout_seconds
self.lock = threading.Lock()
def update_progress(
self,
task_id: str,
status: str,
progress: int,
message: str,
processed_count: Optional[int] = None
) -> None:
"""
更新任务进度
Args:
task_id: 任务唯一标识
status: 状态 (queued, processing, completed, failed)
progress: 进度百分比 (0-100)
message: 进度信息
processed_count: 已处理的数据条数,None 表示暂未统计
"""
with self.lock:
self.progress_data[task_id] = {
'task_id': task_id,
'status': status,
'progress': max(0, min(100, progress)),
'message': message,
'processed_count': processed_count,
'timestamp': datetime.now().isoformat(),
'created_at': datetime.now()
}
logger.debug(f"[{task_id}] 进度更新: {status} {progress}% - {message}")
def get_progress(self, task_id: str) -> Optional[Dict[str, Any]]:
"""
获取任务进度
Args:
task_id: 任务唯一标识
Returns:
Optional[Dict]: 进度信息,若任务不存在或已过期返回 None
"""
with self.lock:
if task_id not in self.progress_data:
return None
data = self.progress_data[task_id]
# 检查是否过期
if datetime.now() - data['created_at'] > timedelta(seconds=self.timeout_seconds):
logger.warning(f"任务 {task_id} 已过期,删除记录")
del self.progress_data[task_id]
return None
# 返回字典副本,移除 created_at(内部字段)
result = {k: v for k, v in data.items() if k != 'created_at'}
return result
def get_all_progress(self) -> Dict[str, Dict[str, Any]]:
"""
获取所有任务的进度信息
Returns:
Dict: 所有任务的进度信息
"""
with self.lock:
# 清理过期任务
expired_tasks = []
for task_id, data in self.progress_data.items():
if datetime.now() - data['created_at'] > timedelta(seconds=self.timeout_seconds):
expired_tasks.append(task_id)
for task_id in expired_tasks:
del self.progress_data[task_id]
logger.info(f"清理过期任务: {task_id}")
# 返回所有有效任务的进度
return {
task_id: {k: v for k, v in data.items() if k != 'created_at'}
for task_id, data in self.progress_data.items()
}
def clear_progress(self, task_id: str) -> None:
"""
清除任务进度记录
Args:
task_id: 任务唯一标识
"""
with self.lock:
if task_id in self.progress_data:
del self.progress_data[task_id]
logger.info(f"清除任务 {task_id} 的进度记录")
import sys
import os
import pandas as pd
import mysql.connector
# 兼容直接运行(python core_py/1低价计算.py)和作为模块被 index.py 导入两种场景
if __name__ == "__main__":
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from config import config
def load_price_map_from_db() -> dict:
"""
从 market_bi.bi_price_xx 读取线下价盘数据,
返回匹配字典: { "产品系列|产品克重|渠道(大写)" -> low_price(float) }
"""
conn = mysql.connector.connect(
host=config.DB_HOST,
port=config.DB_PORT,
user=config.DB_USER,
password=config.DB_PASSWORD,
database="market_bi",
charset="utf8mb4",
)
try:
sql = "SELECT bi_product, pro_weight, channel_type, low_price FROM bi_price_xx"
df_p = pd.read_sql(sql, conn)
finally:
conn.close()
def _clean(s):
return "" if pd.isna(s) else str(s).strip().upper()
df_p["match_key"] = (
df_p["bi_product"].apply(_clean) + "|"
+ df_p["pro_weight"].apply(_clean) + "|"
+ df_p["channel_type"].apply(_clean)
)
df_p["low_price"] = pd.to_numeric(df_p["low_price"], errors="coerce")
return df_p.set_index("match_key")["low_price"].to_dict()
def transform(df_y: pd.DataFrame) -> pd.DataFrame:
"""
供 API 调用的低价计算入口。
接收大宽表 DataFrame(STANDARD_COLUMNS 列名),从数据库 market_bi.bi_price_xx
读取价盘基准,计算并回填以下三列后返回:
- 是否低价:低价 / 正常 / None(无法匹配或缺价格)
- 破价价差:低价时的价差(decimal),正常/无法匹配时为 None
- 低价整改状态:低价时置为 '未整改',其余不改动
Args:
df_y: 大宽表 DataFrame,必须包含列:
产品系列、产品克重、渠道类型(稽查源提供)、产品价格
Returns:
pd.DataFrame: 更新了低价相关字段的 DataFrame(不修改原对象)
"""
df = df_y.copy()
price_map = load_price_map_from_db()
def _clean(s):
return "" if pd.isna(s) else str(s).strip().upper()
# 构建匹配键和数值价格(辅助列,最终会删除)
df["_series_c"] = df["产品系列"].apply(_clean)
df["_weight_c"] = df["产品克重"].apply(_clean)
df["_channel_c"] = df["渠道类型(稽查源提供)"].apply(_clean)
df["_match_key"] = df["_series_c"] + "|" + df["_weight_c"] + "|" + df["_channel_c"]
df["_price_num"] = pd.to_numeric(df["产品价格"], errors="coerce")
df["_p_low_price"] = df["_match_key"].map(price_map)
# 重置低价相关列
df["是否低价"] = None
df["破价价差"] = None
# 条件向量化计算,避免逐行循环
has_both = df["_price_num"].notna() & df["_p_low_price"].notna()
cond_low = has_both & (df["_price_num"] < df["_p_low_price"])
cond_normal = has_both & ~cond_low
df.loc[cond_low, "是否低价"] = "低价"
df.loc[cond_low, "破价价差"] = (
df.loc[cond_low, "_p_low_price"] - df.loc[cond_low, "_price_num"]
).round(2)
df["低价整改状态"] = df["低价整改状态"].astype(object)
df.loc[cond_low, "低价整改状态"] = "未整改"
df.loc[cond_normal, "是否低价"] = "正常"
df.loc[cond_normal, "破价价差"] = None
# 清除辅助列
df.drop(
columns=["_series_c", "_weight_c", "_channel_c", "_match_key", "_price_num", "_p_low_price"],
inplace=True,
)
return df
if __name__ == "__main__":
# ── 独立测试模式:读本地 Excel 大宽表 → 计算低价 → 输出结果文件 ──
from datetime import datetime
from dateutil.relativedelta import relativedelta
current_date = (datetime.now().replace(day=1) - relativedelta(months=1)).strftime("%Y-%m-01")
y_file = f"/王小卤/风控/代码-新/大日期{current_date}_2.xlsx"
output_file = f"/王小卤/风控/代码-新/低价大日期_2.xlsx"
print("正在读取稽查结果大宽表...")
df_y = pd.read_excel(y_file, sheet_name="合并后", dtype=str)
df_y.columns = df_y.columns.str.strip()
print("正在从数据库读取价盘并计算低价...")
df_result = transform(df_y)
df_result.to_excel(output_file, index=False)
print(f"✅ 处理完成!结果已保存至:{output_file}")
import pandas as pd
import copy
import os
from datetime import datetime
from dateutil.relativedelta import relativedelta
# === 本地独立运行配置(仅 __main__ 模式使用)===
source_file = "/王小卤/风控/代码-新//2026.2-团队数据源.xlsx"
def _get_default_audit_date() -> str:
"""返回上月1号作为默认稽查日期,格式 yyyy-mm-01"""
return (datetime.now().replace(day=1) - relativedelta(months=1)).strftime("%Y-%m-01")
# 列映射(目标表列名)
COLUMN_MAPPING = {
"稽查日期": "稽查日期",
"稽查来源": "稽查来源",
"勤策门店编码": "勤策门店编码",
"勤策门店名称": "勤策门店名称",
"经销商名称": "经销商名称",
"城市": "城市",
"渠道类型": "渠道类型(稽查源提供)",
"产品系列": "产品系列",
"产品口味": "产品口味",
"产品克重": "产品克重",
"产品价格": "产品价格",
"产品生产月份": "产品生产月份",
}
# ===== 新增:多产品组配置 =====
# 每组:价格列 + 7个口味列 + 产品信息
# 团队表
PRODUCT_GROUPS_JC = [
# 第1组:虎皮凤爪 210g
{
"price_col": 50,
"flavor_cols": [51, 52, 53, 54, 55, 56, 57],
"series": "虎皮凤爪",
"weight": "210g",
"flavors": ["卤香", "香辣", "椒麻", "火锅", "微辣", "麻辣", "黑鸭"]
},
# 第2组:虎皮凤爪 105g
{
"price_col": 58,
"flavor_cols": [59, 60, 61, 62, 63, 64, 65],
"series": "虎皮凤爪",
"weight": "105g",
"flavors": ["卤香", "香辣", "椒麻", "火锅", "微辣", "麻辣", "黑鸭"]
},
# 第3组:虎皮凤爪 68g
{
"price_col": 66,
"flavor_cols": [67, 68, 69, 70, 71],
"series": "虎皮凤爪",
"weight": "68g",
"flavors": ["卤香", "香辣", "椒麻", "麻辣", "黑鸭"]
},
# 第4组:鸡肉豆堡 120g
{
"price_col": 72,
"flavor_cols": [73, 74],
"series": "鸡肉豆堡",
"weight": "120g",
"flavors": ["卤香", "香辣"]
},
# 第5组:牛肉豆堡 120g
{
"price_col": 75,
"flavor_cols": [76, 77],
"series": "牛肉豆堡",
"weight": "120g",
"flavors": ["卤香", "香辣"]
},
# 第6组:去骨凤爪 72g
{
"price_col": 78,
"flavor_cols": [79, 80],
"series": "去骨凤爪",
"weight": "72g",
"flavors": ["柠檬", "香辣"]
},
# 第7组:去骨凤爪 138g
{
"price_col": 81,
"flavor_cols": [82, 83],
"series": "去骨凤爪",
"weight": "138g",
"flavors": ["柠檬", "香辣"]
},
# 第8组:虎皮小鸡腿 80g
{
"price_col": 84,
"flavor_cols": [85, 86],
"series": "虎皮小鸡腿",
"weight": "80g",
"flavors": ["卤香", "香辣"]
},
# 第9组:老卤凤爪 95g(与老卤鸭掌共用 price_col=87)
{
"price_col": 87,
"flavor_cols": [88],
"series": "老卤凤爪",
"weight": "95g",
"flavors": ["卤香"]
},
# 第10组:老卤鸭掌 95g(与老卤凤爪共用 price_col=87)
{
"price_col": 87,
"flavor_cols": [89],
"series": "老卤鸭掌",
"weight": "95g",
"flavors": ["卤香"]
},
# 第11组:虎皮凤爪 25g
{
"price_col": 90,
"flavor_cols": [91, 92],
"series": "虎皮凤爪",
"weight": "25g",
"flavors": ["卤香", "香辣"]
},
# 第12组:虎皮凤爪 散称
{
"price_col": 93,
"flavor_cols": [94, 95, 96],
"series": "虎皮凤爪",
"weight": "散称",
"flavors": ["卤香", "香辣", "黑鸭"]
}
]
# 标准输出列定义(与目标表结构保持一致)
STANDARD_COLUMNS = [
"稽查日期", "稽查来源", "大区", "战区", "经销商编码", "经销商名称",
"勤策门店编码", "勤策门店名称", "客户经理工号", "客户经理",
"勤策渠道大类", "稽核渠道(对N列清洗)", "城市", "渠道类型(稽查源提供)",
"产品系列", "产品口味", "产品克重", "产品价格", "是否低价", "破价价差", "低价整改状态",
"低价整改说明", "产品生产月份", "临期月份数", "临期状态", "新鲜度",
"大日期整改状态", "大日期整改说明"
]
def _build_records(df_source, yname, pg, existing_columns, audit_date: str = None):
"""
核心记录构建逻辑,供 transform() 和 main() 复用。
Args:
df_source: pandas DataFrame,列通过 iloc 按位置访问
yname: 稽查来源名称,如 '稽查团队'
pg: 产品组配置列表
existing_columns: 目标表的列名列表
audit_date: 稽查日期字符串,格式 'yyyy-mm-dd';为 None 时取上月1号
Returns:
list: 构建好的记录列表(每条为 dict)
"""
if audit_date is None:
audit_date = _get_default_audit_date()
records = []
for idx, row in df_source.iterrows():
base_data = {
"勤策门店编码": str(row.iloc[8]).strip() if pd.notna(row.iloc[8]) else "",
"城市": str(row.iloc[4]).strip() if pd.notna(row.iloc[4]) else "",
"勤策门店名称": str(row.iloc[9]).strip() if pd.notna(row.iloc[9]) else "",
"经销商名称": str(row.iloc[7]).strip() if pd.notna(row.iloc[7]) else "",
"渠道类型": str(row.iloc[10]).strip() if pd.notna(row.iloc[10]) else "",
}
base_row = {}
if COLUMN_MAPPING["稽查日期"] in existing_columns:
base_row[COLUMN_MAPPING["稽查日期"]] = audit_date
if COLUMN_MAPPING["稽查来源"] in existing_columns:
base_row[COLUMN_MAPPING["稽查来源"]] = yname
if COLUMN_MAPPING["勤策门店编码"] in existing_columns:
base_row[COLUMN_MAPPING["勤策门店编码"]] = base_data["勤策门店编码"]
if COLUMN_MAPPING["勤策门店名称"] in existing_columns:
base_row[COLUMN_MAPPING["勤策门店名称"]] = base_data["勤策门店名称"]
if COLUMN_MAPPING["经销商名称"] in existing_columns:
base_row[COLUMN_MAPPING["经销商名称"]] = base_data["经销商名称"]
if COLUMN_MAPPING["城市"] in existing_columns:
base_row[COLUMN_MAPPING["城市"]] = base_data["城市"]
if COLUMN_MAPPING["渠道类型"] in existing_columns:
base_row[COLUMN_MAPPING["渠道类型"]] = base_data["渠道类型"]
for group in pg:
price_col = group["price_col"]
flavor_cols = group["flavor_cols"]
flavors = group["flavors"]
series = group["series"]
weight = group["weight"]
src_price = str(row.iloc[price_col]).strip() if pd.notna(row.iloc[price_col]) else ""
if not src_price or src_price == '无价签':
src_price = ''
row_with_price = copy.deepcopy(base_row)
if COLUMN_MAPPING["产品价格"] in existing_columns:
row_with_price[COLUMN_MAPPING["产品价格"]] = src_price
for i, col_idx in enumerate(flavor_cols):
flavor_name = flavors[i]
src_month = str(row.iloc[col_idx]).strip() if pd.notna(row.iloc[col_idx]) else ""
if src_month:
new_rec = copy.deepcopy(row_with_price)
src_month = normalize_month(src_month)
_set_product_fields(new_rec, series, flavor_name, weight, src_month, existing_columns)
rDate(new_rec)
records.append(new_rec)
elif src_price:
new_rec = copy.deepcopy(row_with_price)
_set_product_fields(new_rec, series, flavor_name, weight, None, existing_columns)
rDate(new_rec)
records.append(new_rec)
return records
def transform(df_source, yname, pg, audit_date: str = None):
"""
供 API 调用的数据转换入口:接收 DataFrame,返回清洗后的记录列表,不读写任何文件。
Args:
df_source: pandas DataFrame,列通过 iloc 按位置访问(与原始 Excel 列顺序对应)
yname: 稽查来源名称,如 '稽查团队'
pg: 产品组配置列表
audit_date: 稽查日期字符串,格式 'yyyy-mm-dd';为 None 时自动取上月1号
Returns:
list[dict]: 按 STANDARD_COLUMNS 结构整理好的记录列表
"""
return _build_records(df_source, yname, pg, STANDARD_COLUMNS, audit_date=audit_date)
# === 主逻辑(独立运行/本地文件模式) ===
def main(df_source, yname, pg, audit_date: str = None):
if audit_date is None:
audit_date = _get_default_audit_date()
target_file = f"/王小卤/风控/代码-新/大日期{audit_date}_2.xlsx"
try:
# 获取目标表结构
try:
df_target = pd.read_excel(target_file, sheet_name="合并后", dtype=str)
existing_columns = df_target.columns.tolist()
except (FileNotFoundError, ValueError):
df_target = pd.DataFrame(columns=STANDARD_COLUMNS)
existing_columns = STANDARD_COLUMNS
records = _build_records(df_source, yname, pg, existing_columns, audit_date=audit_date)
if not records:
print("⚠️ 无有效数据需要追加。")
return
df_new = pd.DataFrame(records, columns=existing_columns)
df_combined = pd.concat([df_target, df_new], ignore_index=True)
if os.path.exists(target_file):
with pd.ExcelWriter(target_file, engine='openpyxl', mode='a', if_sheet_exists='replace') as writer:
df_combined.to_excel(writer, sheet_name="合并后", index=False)
else:
with pd.ExcelWriter(target_file, engine='openpyxl', mode='w') as writer:
df_combined.to_excel(writer, sheet_name="合并后", index=False)
print(f"✅ 成功追加 {len(records)} 条记录到目标表!")
except Exception as e:
print(f"❌ 错误: {e}")
import traceback
traceback.print_exc()
def _set_product_fields(record, series, flavor, weight, prod_month_str, existing_columns):
"""设置产品字段"""
if COLUMN_MAPPING["产品系列"] in existing_columns:
record[COLUMN_MAPPING["产品系列"]] = series
if COLUMN_MAPPING["产品口味"] in existing_columns:
record[COLUMN_MAPPING["产品口味"]] = flavor
if COLUMN_MAPPING["产品克重"] in existing_columns:
record[COLUMN_MAPPING["产品克重"]] = weight
if prod_month_str and COLUMN_MAPPING["产品生产月份"] in existing_columns:
try:
dt = datetime.strptime(prod_month_str, "%Y-%m-%d")
record[COLUMN_MAPPING["产品生产月份"]] = dt.strftime("%Y-%m-%d")
except (ValueError, TypeError):
record[COLUMN_MAPPING["产品生产月份"]] = None
def rDate(row_dict):
"""计算临期状态(保持你原有的业务逻辑)"""
prod_date_str = row_dict.get("产品生产月份", None)
inspect_date_str = row_dict.get("稽查日期", "").strip()
if not prod_date_str or not inspect_date_str:
row_dict["临期状态"] = ""
row_dict["新鲜度"] = ""
row_dict["临期月份数"] = ""
return
try:
prod_date = datetime.strptime(prod_date_str, "%Y-%m-%d")
inspect_date = datetime.strptime(inspect_date_str, "%Y-%m-%d")
except ValueError:
row_dict["临期状态"] = ""
row_dict["新鲜度"] = ""
row_dict["临期月份数"] = ""
return
product_series = row_dict.get("产品系列", "")
zg_status = "未整改"
if product_series == "去骨凤爪":
expiry_date = prod_date + relativedelta(months=6)
gap_months = _calculate_gap_months(expiry_date, inspect_date)
if gap_months >= 2:
status, freshness,zg_status = "非大日期", "高",""
elif 1 <= gap_months < 2:
status, freshness = "大日期", "低"
elif 0 <= gap_months < 1:
status, freshness = "临期", "低"
else:
status, freshness = "过期", "低"
else:
expiry_date = prod_date + relativedelta(months=9)
gap_months = _calculate_gap_months(expiry_date, inspect_date)
if gap_months >= 3:
status, freshness,zg_status = "非大日期", "高",""
elif 1 <= gap_months < 3:
status, freshness = "大日期", "低"
elif 0 <= gap_months < 1:
status, freshness = "临期", "低"
else:
status, freshness = "过期", "低"
row_dict["临期状态"] = status
row_dict["新鲜度"] = freshness
row_dict["临期月份数"] = round(gap_months, 2)
row_dict["大日期整改状态"] = zg_status
def _calculate_gap_months(expiry_date, inspect_date):
diff_years = expiry_date.year - inspect_date.year
diff_months = expiry_date.month - inspect_date.month
diff_days = expiry_date.day - inspect_date.day
return diff_years * 12 + diff_months + diff_days / 30.0
import re
# 这里还需要修改
def normalize_month(src_month):
"""
将生产月份字符串标准化为 'yyyy-mm' 格式。
支持的输入格式:
- 'yyyy-mm'(如 '2025-12')→ 保持不变
- 'yyyymm'(如 '202512')→ 转为 '2025-12'
其他格式或无效值返回原值(或可选返回空字符串)
"""
if not isinstance(src_month, str):
return src_month # 非字符串直接返回
src_month = src_month.strip()
if not src_month:
return src_month
# 情况1: 已是 yyyy-mm 格式(例如 2025-12)
if re.fullmatch(r'\d{4}-\d{1,2}', src_month):
# 可选:统一补零为两位月(如 2025-1 → 2025-01)
year, month = src_month.split('-')
month = month.zfill(2) # 确保月份两位
return f"{year}-{month}-01"
# 情况2: 是 yyyymm 格式(6位数字,如 202512)
if re.fullmatch(r'\d{6}', src_month):
year = src_month[:4]
month = src_month[4:].lstrip('0') or '0' # 防止全零
month = src_month[4:].zfill(2) # 直接取后两位并确保两位(更安全)
return f"{year}-{month}-01"
# 其他格式:不处理(或可根据需求返回空)
return src_month
if __name__ == "__main__":
# TODO: 配置 sheet 页名称
print("正在读取【团队】源文件(跳过第 1 行标题,第 2 行作为数据第 1 行)...")
# 修改点:
# 1. skiprows=1 : 跳过物理第 1 行(标题)
# 2. header=None : 关键!告诉 pandas 不要把物理第 2 行当表头,而是当数据。
# 这样物理第 2 行会变成 df 的第 0 行,列名会自动变成 0, 1, 2...
# 这完美匹配你代码中的 row.iloc[4], row.iloc[8] 等逻辑。
df_source_p = pd.read_excel(source_file, skiprows=1, header=None, dtype=str)
# 验证读取结果(可选,用于调试)
print(f"✅ 成功读取 {len(df_source_p)} 行数据。")
if len(df_source_p) > 0:
print("前 2 行数据预览(确认第 2 行是否在列):")
print(df_source_p.head(2))
print(f"列索引范围:0 到 {len(df_source_p.columns) - 1}")
main(df_source_p, '稽查团队', PRODUCT_GROUPS_JC)
\ No newline at end of file
""" from fastapi import FastAPI
数据清洗系统 - FastAPI 应用主程序 from fastapi.exceptions import HTTPException, RequestValidationError
Description: 提供 Excel 数据解析、清洗和存储的 API 服务
"""
from fastapi import FastAPI, BackgroundTasks from api.exception_handlers import http_exception_handler, validation_exception_handler
from pydantic import BaseModel from api.routes_clean import api_router
import logging
import uuid
import asyncio
import math
import random
import pandas as pd
from io import BytesIO
from datetime import datetime
from typing import Optional, Dict, Any
# 导入业务模块 app = FastAPI(title="Clean Data API")
from core.excel_handler import ExcelHandler app.add_exception_handler(HTTPException, http_exception_handler)
from core.data_cleaner import DataCleaner app.add_exception_handler(RequestValidationError, validation_exception_handler)
from core.db_handler import DatabaseHandler app.include_router(api_router)
from core.progress_manager import ProgressManager
from utils.exceptions import DataCleaningException, DatabaseException
from utils.validators import validate_excel_url
from utils.response import BizCode, ok_resp, fail_resp
# 风控稽查大宽表:中文列名 → 数据库英文字段名
FENGKONG_COLUMN_MAP = {
"稽查日期": "audit_date",
"稽查来源": "source",
"大区": "region_name",
"战区": "district_name",
"经销商编码": "dealer_code",
"经销商名称": "dealer_name",
"勤策门店编码": "store_code",
"勤策门店名称": "store_name",
"客户经理工号": "f_emp_no",
"客户经理": "f_emp_name",
"勤策渠道大类": "qin_ce_type_large",
"稽核渠道(对N列清洗)": "jh_channel_type",
"城市": "city",
"渠道类型(稽查源提供)":"channel_type",
"产品系列": "series",
"产品口味": "taste",
"产品克重": "weight",
"产品价格": "price",
"是否低价": "low_price",
"破价价差": "low_price_diff",
"低价整改状态": "low_price_status",
"低价整改说明": "low_price_rectify",
"产品生产月份": "production_month",
"临期月份数": "near_month_num",
"临期状态": "near_month_status",
"新鲜度": "fresh_status",
"大日期整改状态": "large_date_status",
"大日期整改说明": "large_date_rectify",
}
# risk_audit_visit 各字段类型分组(用于入库前类型强制转换)
_FK_DECIMAL_COLS = {"price", "low_price_diff"}
_FK_INT_COLS = {"near_month_num"}
_FK_DATE_COLS = {"audit_date", "production_month"}
# varchar 字段最大长度限制(超长截断,防止 Data too long 报错)
_FK_VARCHAR_MAX = {
"source": 20, "region_name": 20, "district_name": 20,
"dealer_code": 10, "dealer_name": 100,
"store_code": 20, "store_name": 100,
"f_emp_no": 20, "f_emp_name": 100,
"qin_ce_type_large": 20, "jh_channel_type": 20,
"city": 30, "channel_type": 30,
"series": 20, "taste": 20, "weight": 20,
"low_price": 20, "low_price_status": 20, "low_price_rectify": 100,
"near_month_status": 20, "fresh_status": 20,
"large_date_status": 20, "large_date_rectify": 100,
}
def _coerce_fengkong_row(row: dict) -> dict:
"""
对已完成列名映射(英文 key)的行做类型强制转换,使其与 risk_audit_visit 字段类型完全匹配:
- decimal: 转 float,失败 → None
- int: 转 int,失败 → None
- date: 保留 'YYYY-MM-DD' 前10位,格式非法 → None
- varchar: 转字符串并按最大长度截断,空值 → None
"""
result = {}
for col, val in row.items():
# 统一空值处理
if val is None or (isinstance(val, str) and val.strip() == ''):
result[col] = None
continue
if col in _FK_DECIMAL_COLS:
try:
result[col] = float(val)
except (ValueError, TypeError):
result[col] = None
elif col in _FK_INT_COLS:
try:
result[col] = int(float(val))
except (ValueError, TypeError):
result[col] = None
elif col in _FK_DATE_COLS:
s = str(val)[:10]
try:
datetime.strptime(s, "%Y-%m-%d")
result[col] = s
except ValueError:
result[col] = None
else:
# varchar:转字符串,按最大长度截断
s = str(val).strip()
max_len = _FK_VARCHAR_MAX.get(col)
result[col] = s[:max_len] if max_len else s
return result
def _sanitize_nan(records: list) -> list:
"""将列表中每行 dict 里的 float NaN / Inf 以及空字符串替换为 None,确保数据库写入兼容。"""
sanitized = []
for row in records:
sanitized.append({
k: (None if (isinstance(v, float) and (math.isnan(v) or math.isinf(v)))
or (isinstance(v, str) and v.strip() == '')
else v)
for k, v in row.items()
})
return sanitized
# 配置日志
logging.basicConfig(
level=logging.INFO, # 只记录 INFO 以上的日志
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s' # 时间 - 模块名 - 级别 - 内容
)
logger = logging.getLogger(__name__) # __name__ 运行时获取模块名
# 创建 FastAPI 应用
app = FastAPI(
title="数据清洗系统",
description="用于数据解析、清洗和持久化的 API 服务",
version="1.0.0" # 添加这一行
)
# ==================== 请求数据模型 ====================
class CleaningRequest(BaseModel):
"""数据清洗请求模型"""
excel_url: Optional[str] = None # 普通清洗模式必填;风控稽查模式可不传
department: str
description: Optional[str] = None
audit_date: Optional[str] = None # 稽查日期,格式 'yyyy-mm-dd',不传则取上月1号
# ── 风控稽查数据清洗 专用字段 ──────────────────────────────────
year: Optional[int] = None # 数据所属年
month: Optional[int] = None # 数据所属月
day: Optional[int] = None # 数据所属日
team_url: Optional[str] = None # 团队数据表链接
puling_url: Optional[str] = None # 浦零数据表链接
chengyu_url: Optional[str] = None # 诚予数据表链接
class SavingRequest(BaseModel):
"""数据保存请求模型"""
task_id: str
table_name: Optional[str] = None # 风控稽查任务已预设表名,可不传
# ==================== 业务逻辑 ====================
class DataCleaningService:
"""数据清洗服务主类"""
# 性能基准参数(可根据实际情况调整)
DOWNLOAD_TIME_BASE = 2 # 下载和解析基础时间(秒)
DOWNLOAD_TIME_PER_ROW = 0.0001 # 每行数据的下载时间(秒)
CLEANING_TIME_PER_ROW = 0.001 # 每行数据的清洗时间(秒)
VALIDATION_TIME_BASE = 1 # 验证基础时间(秒)
CACHING_TIME_PER_ROW = 0.0001 # 每行数据的缓存时间(秒)
CACHE_TTL_SECONDS = 1800 # cache 保留时长:30 分钟
def __init__(self):
self.progress_manager = ProgressManager()
self.excel_handler = ExcelHandler()
self.data_cleaner = DataCleaner()
self.db_handler = DatabaseHandler()
# 存储已清洗的数据(内存中,可扩展为 Redis)
self.cleaned_data_cache: Dict[str, Any] = {}
# 正在执行保存操作的 task_id 集合,用于防止并发重复写入
self._saving_tasks: set = set()
def _evict_expired_cache(self):
"""清除超过 TTL 的 cache 条目,在写入和读取时调用"""
now = datetime.now()
expired = [
tid for tid, v in self.cleaned_data_cache.items()
if (now - v['created_at']).total_seconds() > self.CACHE_TTL_SECONDS
]
for tid in expired:
del self.cleaned_data_cache[tid]
logger.info(f"[cache] 已清除过期任务 {tid}")
def estimate_completion_time(self, row_count: int) -> int:
"""
根据数据行数预估完成时间
Args:
row_count: Excel 文件的数据行数
Returns:
int: 预估完成时间(秒)
"""
# 计算各阶段时间
download_time = self.DOWNLOAD_TIME_BASE + (row_count * self.DOWNLOAD_TIME_PER_ROW)
validation_time = self.VALIDATION_TIME_BASE
cleaning_time = row_count * self.CLEANING_TIME_PER_ROW
caching_time = row_count * self.CACHING_TIME_PER_ROW
# 总时间(向上取整)
total_time = int(download_time + validation_time + cleaning_time + caching_time)
# 最少 5 秒,最多 3600 秒(1小时)
return max(5, min(total_time, 3600))
async def clean_data_from_url(
self,
task_id: str,
excel_url: str,
department: str,
raw_data: list = None,
audit_date: str = None
) -> Dict[str, Any]:
"""
从 URL 下载并清洗 Excel 数据
Args:
task_id: 任务唯一标识
excel_url: Excel 文件的网络链接
department: 业务部门名称
raw_data: 可选,已下载的原始数据(由路由层传入以避免重复下载)
audit_date: 稽查日期字符串,格式 'yyyy-mm-dd'
Returns:
包含清洗结果的字典
"""
try:
# 1. 记录任务开始
self.progress_manager.update_progress(
task_id,
status="processing",
progress=10,
message="开始下载 Excel 文件"
)
logger.info(f"[{task_id}] 开始处理数据清洗任务")
# 2. 下载并解析 Excel(若路由层已下载则直接复用,避免重复请求)
if raw_data is None:
self.progress_manager.update_progress(
task_id,
status="processing",
progress=20,
message="正在解析 Excel 文件"
)
raw_data = await self.excel_handler.fetch_and_parse(excel_url)
logger.info(f"[{task_id}] 成功解析 Excel,数据行数: {len(raw_data)}")
# 3. 数据验证
self.progress_manager.update_progress(
task_id,
status="processing",
progress=30,
message="正在验证数据"
)
if not raw_data:
raise DataCleaningException("解析的 Excel 数据为空")
# 4. 执行数据清洗
self.progress_manager.update_progress(
task_id,
status="processing",
progress=50,
message="正在清洗数据"
)
cleaned_data = await self.data_cleaner.clean(
raw_data,
department,
progress_callback=lambda p, m, count=None: self.progress_manager.update_progress(
task_id,
status="processing",
progress=int(50 + p * 0.4), # 进度从50%到90%
message=m,
processed_count=count
),
audit_date=audit_date
)
logger.info(f"[{task_id}] 数据清洗完成,清洗后数据行数: {len(cleaned_data)}")
# 5. 缓存清洗后的数据(写入前先清除过期条目)
self.progress_manager.update_progress(
task_id,
status="processing",
progress=90,
message="正在缓存清洗后的数据"
)
self._evict_expired_cache()
safe_data = _sanitize_nan(cleaned_data)
self.cleaned_data_cache[task_id] = {
'data': safe_data,
'department': department,
'created_at': datetime.now(),
'row_count': len(safe_data)
}
# 6. 任务完成
self.progress_manager.update_progress(
task_id,
status="completed",
progress=100,
message="数据清洗完成,等待前端确认",
processed_count=len(cleaned_data)
)
return {
'task_id': task_id,
'status': 'completed',
'message': '数据清洗成功',
'data_preview': cleaned_data[:5], # 返回前5行用于预览
'total_rows': len(cleaned_data)
}
except DataCleaningException as e:
logger.error(f"[{task_id}] 数据清洗业务异常: {str(e)}")
self.progress_manager.update_progress(
task_id,
status="failed",
progress=0,
message=f"清洗失败: {str(e)}"
)
raise
except Exception as e:
logger.error(f"[{task_id}] 数据清洗系统异常: {str(e)}", exc_info=True)
self.progress_manager.update_progress(
task_id,
status="failed",
progress=0,
message=f"系统异常: {str(e)}"
)
raise DataCleaningException(f"未知错误: {str(e)}")
async def save_cleaned_data(
self,
task_id: str,
table_name: str
) -> Dict[str, Any]:
"""
将清洗后的数据保存到数据库
Args:
task_id: 任务唯一标识
table_name: 目标表名
Returns:
包含保存结果的字典
"""
# ── 并发防重:同一 task_id 只允许一个 save 请求在执行 ──────────
# asyncio 是单线程协程模型,此处 check-and-add 之间不会发生协程切换,
# 因此无需额外加锁,天然原子。
if task_id in self._saving_tasks:
raise DatabaseException(f"任务 {task_id} 正在保存中,请勿重复提交")
self._saving_tasks.add(task_id)
try:
logger.info(f"[{task_id}] 开始保存数据到数据库")
# 验证数据是否存在(先清除过期条目)
self._evict_expired_cache()
if task_id not in self.cleaned_data_cache:
raise DatabaseException(f"任务 {task_id} 的清洗数据不存在或已过期(超过30分钟)")
cached = self.cleaned_data_cache[task_id]
cleaned_data = cached['data']
# 优先使用缓存中预设的表名(风控稽查任务已写死 risk_audit_visit)
target_table = cached.get('table_name') or table_name
if not target_table:
raise DatabaseException("未指定目标表名,请在请求中传入 table_name")
# 将中文列名映射为数据库英文字段名,并强制转换各字段类型(仅对 risk_audit_visit 生效)
if target_table == "risk_audit_visit":
cleaned_data = [
_coerce_fengkong_row(
{FENGKONG_COLUMN_MAP[k]: v for k, v in row.items() if k in FENGKONG_COLUMN_MAP}
)
for row in cleaned_data
]
# 保存到数据库
submitted_rows, raw_affected = await self.db_handler.insert_data(
target_table,
cleaned_data
)
logger.info(
f"[{task_id}] 成功保存到 {target_table},"
f"提交行数={submitted_rows},raw_affected={raw_affected}"
)
# 清理缓存
del self.cleaned_data_cache[task_id]
return {
'task_id': task_id,
'status': 'saved',
'message': '数据已成功保存到数据库',
'affected_rows': submitted_rows, # 真实提交(去重后)行数,与预览页 total_rows 一致
}
except DatabaseException as e:
logger.error(f"[{task_id}] 数据库异常: {str(e)}")
raise
except Exception as e:
logger.error(f"[{task_id}] 保存数据时出错: {str(e)}", exc_info=True)
raise DatabaseException(f"保存失败: {str(e)}")
finally:
# 无论成功或失败,都释放保存锁,避免任务永远卡在「保存中」状态
self._saving_tasks.discard(task_id)
async def clean_fengkong_data(
self,
task_id: str,
team_url: Optional[str],
puling_url: Optional[str],
chengyu_url: Optional[str],
audit_date: Optional[str],
) -> Dict[str, Any]:
"""
风控稽查数据清洗:分别下载团队、浦零、诚予数据源,各自清洗后合并为一张大宽表,
结果存入内存缓存,不写本地文件。
Args:
task_id: 任务唯一标识
team_url: 团队数据表下载链接(可为 None)
puling_url: 浦零数据表下载链接(可为 None)
chengyu_url: 诚予数据表下载链接(可为 None)
audit_date: 稽查日期,格式 'yyyy-mm-dd';为 None 时各模块自动取上月1号
"""
from core_py.数据转换_团队 import (
transform as team_transform,
PRODUCT_GROUPS_JC,
STANDARD_COLUMNS,
)
from core_py.数据转换_诚予_浦零 import (
transform as pl_cy_transform,
PRODUCT_GROUPS,
PRODUCT_GROUPS_CY,
)
try:
self.progress_manager.update_progress(
task_id, status="processing", progress=5, message="开始风控稽查数据清洗"
)
logger.info(f"[{task_id}] 开始风控稽查数据清洗,audit_date={audit_date}")
all_records = []
progress_step = 0
source_count = sum(1 for u in [team_url, puling_url, chengyu_url] if u)
progress_per_source = int(80 / source_count) if source_count else 80
# ── 1. 团队数据 ──────────────────────────────────────────
if team_url:
progress_step += progress_per_source
self.progress_manager.update_progress(
task_id, status="processing",
progress=max(10, progress_step - progress_per_source + 10),
message="正在下载团队数据表..."
)
raw_bytes = await self.excel_handler.fetch_bytes(team_url)
df_team = await asyncio.to_thread(
pd.read_excel, BytesIO(raw_bytes), skiprows=1, header=None, dtype=str
)
self.progress_manager.update_progress(
task_id, status="processing",
progress=max(10, progress_step - progress_per_source // 2),
message="正在清洗团队数据..."
)
records_team = await asyncio.to_thread(
team_transform, df_team, "稽查团队", PRODUCT_GROUPS_JC, audit_date
)
all_records.extend(records_team)
logger.info(f"[{task_id}] 团队数据清洗完成,{len(records_team)} 条记录")
# ── 2. 浦零数据 ──────────────────────────────────────────
if puling_url:
progress_step += progress_per_source
self.progress_manager.update_progress(
task_id, status="processing",
progress=max(15, progress_step - progress_per_source + 10),
message="正在下载浦零数据表..."
)
raw_bytes = await self.excel_handler.fetch_bytes(puling_url)
df_pl = await asyncio.to_thread(
pd.read_excel, BytesIO(raw_bytes), header=2, dtype=str
)
self.progress_manager.update_progress(
task_id, status="processing",
progress=max(15, progress_step - progress_per_source // 2),
message="正在清洗浦零数据..."
)
records_pl = await asyncio.to_thread(
pl_cy_transform, df_pl, "浦零", PRODUCT_GROUPS, audit_date
)
all_records.extend(records_pl)
logger.info(f"[{task_id}] 浦零数据清洗完成,{len(records_pl)} 条记录")
# ── 3. 诚予数据 ──────────────────────────────────────────
if chengyu_url:
progress_step += progress_per_source
self.progress_manager.update_progress(
task_id, status="processing",
progress=max(20, progress_step - progress_per_source + 10),
message="正在下载诚予数据表..."
)
raw_bytes = await self.excel_handler.fetch_bytes(chengyu_url)
df_cy = await asyncio.to_thread(
pd.read_excel, BytesIO(raw_bytes), header=2, dtype=str
)
self.progress_manager.update_progress(
task_id, status="processing",
progress=max(20, progress_step - progress_per_source // 2),
message="正在清洗诚予数据..."
)
records_cy = await asyncio.to_thread(
pl_cy_transform, df_cy, "诚予", PRODUCT_GROUPS_CY, audit_date
)
all_records.extend(records_cy)
logger.info(f"[{task_id}] 诚予数据清洗完成,{len(records_cy)} 条记录")
# ── 4. 合并为大宽表(内存,不写文件) ──────────────────
self.progress_manager.update_progress(
task_id, status="processing", progress=85, message="正在合并数据宽表..."
)
df_merged = pd.DataFrame(all_records, columns=STANDARD_COLUMNS)
logger.info(f"[{task_id}] 大宽表合并完成,共 {len(df_merged)} 条记录")
# ── 5. 低价计算(从数据库读取价盘,回填低价字段) ────────
self.progress_manager.update_progress(
task_id, status="processing", progress=93, message="正在执行低价计算..."
)
import importlib.util, pathlib
_lp_spec = importlib.util.spec_from_file_location(
"low_price_calc",
pathlib.Path(__file__).parent / "core_py" / "1低价计算.py",
)
_lp_mod = importlib.util.module_from_spec(_lp_spec)
_lp_spec.loader.exec_module(_lp_mod)
df_final = await asyncio.to_thread(_lp_mod.transform, df_merged)
final_records = _sanitize_nan(
df_final.where(pd.notna(df_final), None).to_dict(orient="records")
)
logger.info(f"[{task_id}] 低价计算完成,共 {len(final_records)} 条记录")
# ── 6. 写入内存缓存 ──────────────────────────────────────
self._evict_expired_cache()
self.cleaned_data_cache[task_id] = {
"data": final_records,
"department": "风控稽查数据清洗",
"created_at": datetime.now(),
"row_count": len(final_records),
"table_name": "risk_audit_visit",
}
self.progress_manager.update_progress(
task_id, status="completed", progress=100,
message=f"风控稽查数据清洗完成,共 {len(final_records)} 条记录,等待前端确认",
processed_count=len(final_records)
)
return {
"task_id": task_id,
"status": "completed",
"message": "风控稽查数据清洗成功",
"data_preview": final_records[:5],
"total_rows": len(final_records),
}
except Exception as e:
logger.error(f"[{task_id}] 风控稽查数据清洗失败: {str(e)}", exc_info=True)
self.progress_manager.update_progress(
task_id, status="failed", progress=0,
message=f"清洗失败: {str(e)}"
)
raise
# ==================== 初始化服务 ====================
service = DataCleaningService()
# ==================== API 路由 ====================
@app.post("/api/v1/clean")
async def start_cleaning(request: CleaningRequest, background_tasks: BackgroundTasks):
"""
启动数据清洗任务
Returns: { code, msg, data: { task_id, status, estimated_completion_time, total_rows } }
"""
try:
task_id = str(uuid.uuid4())
logger.info(f"创建新任务: {task_id}, 部门: {request.department}")
# ── 风控稽查数据清洗 专用分支 ──────────────────────────────
if request.department == "风控稽查数据清洗":
if not any([request.team_url, request.puling_url, request.chengyu_url]):
return fail_resp(BizCode.BAD_REQUEST, "风控稽查数据清洗至少需要提供一个数据源地址(team_url / puling_url / chengyu_url)")
# 从 year/month/day 构造稽查日期,未传则由清洗模块自动取上月1号
audit_date = None
if request.year and request.month and request.day:
audit_date = f"{request.year}-{request.month:02d}-{request.day:02d}"
estimated_rows = 1000
estimated_time = service.estimate_completion_time(estimated_rows)
# 提前写入 queued 进度,避免前端轮询时返回 404
service.progress_manager.update_progress(
task_id, status="queued", progress=0, message="任务已创建,等待处理"
)
background_tasks.add_task(
service.clean_fengkong_data,
task_id,
request.team_url,
request.puling_url,
request.chengyu_url,
audit_date,
)
# ── 普通清洗分支 ───────────────────────────────────────────
else:
if not validate_excel_url(request.excel_url):
return fail_resp(BizCode.BAD_REQUEST, "Excel URL 格式无效")
estimated_rows = 0
estimated_time = 5
prefetched_raw_data = None
try:
prefetched_raw_data = await service.excel_handler.fetch_and_parse(request.excel_url)
estimated_rows = len(prefetched_raw_data)
estimated_time = service.estimate_completion_time(estimated_rows)
logger.info(f"[{task_id}] 预估数据行数: {estimated_rows}, 预估完成时间: {estimated_time}秒")
except Exception as e:
logger.warning(f"[{task_id}] 预读 Excel 失败,后台任务将重新下载: {str(e)}")
estimated_rows = 1000
estimated_time = service.estimate_completion_time(estimated_rows)
background_tasks.add_task(
service.clean_data_from_url,
task_id,
request.excel_url,
request.department,
prefetched_raw_data,
request.audit_date,
)
return ok_resp(
data={
"task_id": task_id,
"status": "queued",
"estimated_completion_time": estimated_time,
"total_rows": estimated_rows,
},
msg="任务已创建,正在处理中..."
)
except Exception as e:
logger.error(f"启动清洗任务失败: {str(e)}")
return fail_resp(BizCode.SERVER_ERROR, f"启动任务失败: {str(e)}", http_status=500)
@app.get("/api/v1/progress/{task_id}")
async def get_progress(task_id: str):
"""
获取数据清洗进度(HTTP 轮询,建议前端每 500ms-1s 调用一次)
Returns: { code, msg, data: { task_id, status, progress, message, timestamp } }
"""
try:
progress_data = service.progress_manager.get_progress(task_id)
if not progress_data:
return fail_resp(BizCode.NOT_FOUND, "任务不存在", http_status=404)
return ok_resp(data=progress_data)
except Exception as e:
logger.error(f"获取进度失败: {str(e)}")
return fail_resp(BizCode.SERVER_ERROR, "获取进度失败", http_status=500)
@app.get("/api/v1/result/{task_id}")
async def get_cleaning_result(task_id: str):
"""
获取清洗结果及数据预览(任务完成后调用)
Returns: { code, msg, data: { task_id, status, data_preview, total_rows, department } }
"""
try:
progress_data = service.progress_manager.get_progress(task_id)
if not progress_data:
return fail_resp(BizCode.NOT_FOUND, "任务不存在", http_status=404)
if progress_data['status'] == 'processing':
return fail_resp(BizCode.TASK_PROCESSING, "任务仍在处理中", http_status=202)
if progress_data['status'] == 'failed':
return fail_resp(BizCode.TASK_FAILED, progress_data['message'])
service._evict_expired_cache()
if task_id not in service.cleaned_data_cache:
return fail_resp(BizCode.NOT_FOUND, "清洗数据不存在或已过期(超过30分钟)", http_status=404)
cached = service.cleaned_data_cache[task_id]
raw_data = cached['data']
# 对 risk_audit_visit 先做列名映射 + 类型转换,再基于唯一键去重,
# 得到真正会写入数据库的行数(用于 total_rows);预览数据保留中文列名
target_table = cached.get('table_name', '')
if target_table == "risk_audit_visit":
mapped = [
_coerce_fengkong_row(
{FENGKONG_COLUMN_MAP[k]: v for k, v in row.items() if k in FENGKONG_COLUMN_MAP}
)
for row in raw_data
]
# 按唯一键去重(保留最后一条,与 ON DUPLICATE KEY UPDATE 行为一致)
_BIZ_KEYS = ("audit_date", "source", "store_name", "channel_type", "series", "taste", "weight")
dedup: dict = {}
for i, row in enumerate(mapped):
key = tuple(row.get(k) for k in _BIZ_KEYS)
dedup[key] = i # 只记录原始行索引,用于去重后从 raw_data 取中文行
total_rows = len(dedup)
# 用去重后的索引对应回 raw_data(中文列名),保证预览列始终为中文
dedup_raw = [raw_data[i] for i in dedup.values()]
else:
dedup_raw = raw_data
total_rows = len(raw_data)
# 随机抽取最多 20 行用于前端预览(中文列名)
sample_rows = random.sample(dedup_raw, min(20, len(dedup_raw)))
return ok_resp(
data={
"task_id": task_id,
"status": "ready_to_save",
"data_preview": sample_rows,
"total_rows": total_rows, # 去重后的预估入库行数
"raw_rows": cached['row_count'], # 清洗前宽表原始行数,供参考
"department": cached['department']
},
msg="数据清洗完成,可进行保存"
)
except Exception as e:
logger.error(f"获取清洗结果失败: {str(e)}")
return fail_resp(BizCode.SERVER_ERROR, "获取结果失败", http_status=500)
@app.post("/api/v1/save")
async def save_cleaned_data(request: SavingRequest):
"""
保存清洗后的数据到 MySQL 数据库(前端确认数据无误后调用)
Returns: { code, msg, data: { task_id, status, affected_rows } }
"""
try:
if not request.task_id:
return fail_resp(BizCode.BAD_REQUEST, "参数不完整:task_id 为必填")
result = await service.save_cleaned_data(request.task_id, request.table_name)
return ok_resp(data=result, msg="数据已成功保存到数据库")
except DatabaseException as e:
logger.error(f"保存数据失败: {str(e)}")
return fail_resp(BizCode.DB_ERROR, str(e), http_status=500)
except Exception as e:
logger.error(f"保存数据时发生错误: {str(e)}")
return fail_resp(BizCode.SERVER_ERROR, f"保存失败: {str(e)}", http_status=500)
@app.get("/api/v1/url-link")
async def get_url_link():
"""
从数据库 fortune-hub.transfer_url 表读取跳转链接
Returns: { code, msg, data: { url_link: str } }
"""
try:
with service.db_handler._get_connection() as conn:
cursor = conn.cursor(dictionary=True)
cursor.execute("SELECT `url_link` FROM `fortune-hub`.`transfer_url` LIMIT 1")
row = cursor.fetchone()
cursor.close()
if not row or not row.get("url_link"):
return fail_resp(BizCode.NOT_FOUND, "未查询到跳转链接数据", http_status=404)
return ok_resp(data={"url_link": row["url_link"]})
except Exception as e:
logger.error(f"获取跳转链接失败: {str(e)}")
return fail_resp(BizCode.DB_ERROR, f"获取跳转链接失败: {str(e)}", http_status=500)
@app.get("/api/v1/health")
async def health_check():
"""健康检查接口"""
return ok_resp(
data={"service": "数据清洗系统", "timestamp": str(datetime.now())},
msg="healthy"
)
@app.get("/")
async def root():
"""根路由 - API 欢迎信息"""
return ok_resp(
data={"version": "1.0.0", "docs": "/docs", "redoc": "/redoc"},
msg="欢迎使用数据清洗系统"
)
# ==================== 异常处理 ====================
@app.exception_handler(DataCleaningException)
async def data_cleaning_exception_handler(request, exc):
"""处理数据清洗异常"""
logger.error(f"DataCleaningException: {str(exc)}")
return fail_resp(BizCode.TASK_FAILED, str(exc), http_status=400)
@app.exception_handler(DatabaseException)
async def database_exception_handler(request, exc):
"""处理数据库异常"""
logger.error(f"DatabaseException: {str(exc)}")
return fail_resp(BizCode.DB_ERROR, str(exc), http_status=500)
# ==================== 应用启动和关闭事件 ====================
@app.on_event("startup")
async def startup_event():
"""应用启动时的初始化"""
logger.info("数据清洗系统启动")
try:
# 初始化数据库连接等
pass
except Exception as e:
logger.error(f"启动时出错: {str(e)}")
@app.on_event("shutdown")
async def shutdown_event():
"""应用关闭时的清理"""
logger.info("数据清洗系统关闭")
try:
# 关闭数据库连接等
pass
except Exception as e:
logger.error(f"关闭时出错: {str(e)}")
# ==================== 主程序入口 ====================
if __name__ == "__main__":
import uvicorn
# 运行 Uvicorn 服务器
uvicorn.run(
app,
host="0.0.0.0",
port=8000,
log_level="info",
reload=True # 开发环境下启用热重载
)
import pandas as pd # 数据转换_团队.py
# 团队稽查宽表仅从 URL 拉取;转窄表后写入本地目标 xlsx「合并后」sheet,并计算临期/大日期/新鲜度等。
import copy import copy
import os import os
import sys
import urllib.error
from datetime import datetime from datetime import datetime
from pathlib import Path
import pandas as pd
from dateutil.relativedelta import relativedelta from dateutil.relativedelta import relativedelta
# TODO: === 配置区 === # 动态加载本文件时保证能 import code 下的 utils
# TODO: 配置稽查月份(默认1号)0:1,1:12,2:11,3:10 _CODE_ROOT = Path(__file__).resolve().parent.parent
current_date = (datetime.now().replace(day=1) - relativedelta(months=1)).strftime("%Y-%m-01") if str(_CODE_ROOT) not in sys.path:
source_file = "/王小卤/风控/代码-新/2026.2-浦零数据源.xlsx" sys.path.insert(0, str(_CODE_ROOT))
#source_file_cy = "/Users/a02200059/Desktop/王小卤/风控中心/低价+大日期/2512门店稽查结果/诚予国际.xlsx"
target_file = f"/王小卤/风控/代码-新/大日期{current_date}_2.xlsx" from utils.dates import ( # noqa: E402
approx_gap_months_calendar,
first_yyyy_mm_dd_in_dataframe,
first_yyyy_mm_dd_in_iloc,
normalize_year_month_to_day01,
to_yyyy_mm_dd,
)
from utils.excel_http import read_excel_from_url_skip1_with_header_row # noqa: E402
def _resolve_audit_date(
audit_date_str: str | None,
df_target: pd.DataFrame,
df_source: pd.DataFrame | None = None,
*,
source_audit_col: int = 0,
) -> tuple[str | None, str | None]:
"""稽查日期:团队宽表指定列 → 显式参数 → 目标表列/第三列。返回 (YYYY-MM-DD, 错误信息)。"""
n = None
if df_source is not None and df_source.shape[1] > source_audit_col:
n = first_yyyy_mm_dd_in_iloc(df_source, source_audit_col)
if n is None and audit_date_str is not None and str(audit_date_str).strip():
n = to_yyyy_mm_dd(audit_date_str)
if n is None:
return None, f"稽查日期参数无法解析: {audit_date_str!r}"
if n is None:
n = first_yyyy_mm_dd_in_dataframe(
df_target,
("稽查日期列", "稽查日期"),
third_column_fallback=True,
)
if n:
return n, None
return None, (
"未获取到稽查日期:请在团队宽表「稽核/稽查日期」列(首行表头须含该字样)逐行填写;"
"或在目标表「合并后」填写;或传入 audit_date_str。"
)
# 列映射(目标表列名) # 列映射(目标表列名)
COLUMN_MAPPING = { COLUMN_MAPPING = {
"稽查日期": "稽查日期", "稽查日期": "稽查日期", # 逻辑名 → 目标表列名
"稽查来源": "稽查来源", "稽查来源": "稽查来源",
"勤策门店编码": "勤策门店编码", "勤策门店编码": "勤策门店编码",
"勤策门店名称": "勤策门店名称", "勤策门店名称": "勤策门店名称",
...@@ -28,225 +74,140 @@ COLUMN_MAPPING = { ...@@ -28,225 +74,140 @@ COLUMN_MAPPING = {
"产品生产月份": "产品生产月份", "产品生产月份": "产品生产月份",
} }
# ===== 新增:多产品组配置 ===== # ===== 多产品组配置:每组对应源表一截「价格 + 多口味月份列」=====
# 每组:价格列 + 7个口味列 + 产品信息 PRODUCT_GROUPS_JC = [
# 诚予国际
PRODUCT_GROUPS_CY = [
# 第1组:虎皮凤爪 210g # 第1组:虎皮凤爪 210g
{ {
"price_col": 7, "price_col": 50, # 源表列索引:该组价格
"flavor_cols": [8, 9, 10, 11, 12, 13, 14], "flavor_cols": [51, 52, 53, 54, 55, 56, 57], # 各口味生产月份列索引
"series": "虎皮凤爪", "series": "虎皮凤爪", # 写入目标表的产品系列
"weight": "210g", "weight": "210g", # 克重文案
"flavors": ["卤香", "香辣", "椒麻", "火锅", "微辣", "麻辣", "黑鸭"] "flavors": ["卤香", "香辣", "椒麻", "火锅", "微辣", "麻辣", "黑鸭"] # 与 flavor_cols 一一对应
}, },
# 第2组:虎皮凤爪 105g # 第2组:虎皮凤爪 105g
{ {
"price_col": 15, "price_col": 58,
"flavor_cols": [16, 17, 18, 19, 20, 21, 22], "flavor_cols": [59, 60, 61, 62, 63, 64, 65],
"series": "虎皮凤爪", "series": "虎皮凤爪",
"weight": "105g", "weight": "105g",
"flavors": ["卤香", "香辣", "椒麻", "火锅", "微辣", "麻辣", "黑鸭"] "flavors": ["卤香", "香辣", "椒麻", "火锅", "微辣", "麻辣", "黑鸭"]
}, },
# 第3组:虎皮凤爪 68g # 第3组:虎皮凤爪 68g
{ {
"price_col": 23, "price_col": 66,
"flavor_cols": [24, 25, 26, 27, 28], "flavor_cols": [67, 68, 69, 70, 71],
"series": "虎皮凤爪", "series": "虎皮凤爪",
"weight": "68g", "weight": "68g",
"flavors": ["卤香", "香辣", "椒麻", "麻辣", "黑鸭"] "flavors": ["卤香", "香辣", "椒麻", "麻辣", "黑鸭"]
}, },
# 第4组:鸡肉豆堡 120g # 第4组:鸡肉豆堡 120g
{ {
"price_col": 29, "price_col": 72,
"flavor_cols": [30, 31], "flavor_cols": [73, 74],
"series": "鸡肉豆堡", "series": "鸡肉豆堡",
"weight": "120g", "weight": "120g",
"flavors": ["卤香", "香辣"] "flavors": ["卤香", "香辣"]
}, },
# 第5组:牛肉豆堡 120g # 第5组:牛肉豆堡 120g
{ {
"price_col": 32, "price_col": 75,
"flavor_cols": [33, 34], "flavor_cols": [76, 77],
"series": "牛肉豆堡", "series": "牛肉豆堡",
"weight": "120g", "weight": "120g",
"flavors": ["卤香", "香辣"] "flavors": ["卤香", "香辣"]
}, },
# 第6组:去骨凤爪 72g # 第6组:去骨凤爪 72g
{ {
"price_col": 35, "price_col": 78,
"flavor_cols": [36, 37], "flavor_cols": [79, 80],
"series": "去骨凤爪", "series": "去骨凤爪",
"weight": "72g", "weight": "72g",
"flavors": ["柠檬", "香辣"] "flavors": ["柠檬", "香辣"]
}, },
# 第7组:去骨凤爪 138g # 第7组:去骨凤爪 138g
{ {
"price_col": 38, "price_col": 81,
"flavor_cols": [39, 40], "flavor_cols": [82, 83],
"series": "去骨凤爪", "series": "去骨凤爪",
"weight": "138g", "weight": "138g",
"flavors": ["柠檬", "香辣"] "flavors": ["柠檬", "香辣"]
}, },
# 第8组:虎皮小鸡腿 80g # 第8组:虎皮小鸡腿 80g
{ {
"price_col": 41, "price_col": 84,
"flavor_cols": [42, 43], "flavor_cols": [85, 86],
"series": "虎皮小鸡腿", "series": "虎皮小鸡腿",
"weight": "80g", "weight": "80g",
"flavors": ["卤香", "香辣"] "flavors": ["卤香", "香辣"]
}, },
# 第9组:老卤凤爪 95g(与老卤鸭掌共用 price_col=44 # 第9组:老卤凤爪 95g(与老卤鸭掌共用 price_col=87
{ {
"price_col": 44, "price_col": 87,
"flavor_cols": [45], "flavor_cols": [88],
"series": "老卤凤爪", "series": "老卤凤爪",
"weight": "95g", "weight": "95g",
"flavors": ["卤香"] "flavors": ["卤香"]
}, },
# 第10组:老卤鸭掌 95g(与老卤凤爪共用 price_col=44 # 第10组:老卤鸭掌 95g(与老卤凤爪共用 price_col=87
{ {
"price_col": 44, "price_col": 87,
"flavor_cols": [46], "flavor_cols": [89],
"series": "老卤鸭掌", "series": "老卤鸭掌",
"weight": "95g", "weight": "95g",
"flavors": ["卤香"] "flavors": ["卤香"]
}, },
# 第11组:虎皮凤爪 25g # 第11组:虎皮凤爪 25g
{ {
"price_col": 47, "price_col": 90,
"flavor_cols": [48, 49], "flavor_cols": [91, 92],
"series": "虎皮凤爪", "series": "虎皮凤爪",
"weight": "25g", "weight": "25g",
"flavors": ["卤香", "香辣"] "flavors": ["卤香", "香辣"]
}, },
# 第12组:虎皮凤爪 散称 # 第12组:虎皮凤爪 散称
{ {
"price_col": 50, "price_col": 93,
"flavor_cols": [51, 52, 53], "flavor_cols": [94, 95, 96],
"series": "虎皮凤爪", "series": "虎皮凤爪",
"weight": "散称", "weight": "散称",
"flavors": ["卤香", "香辣", "黑鸭"] "flavors": ["卤香", "香辣", "黑鸭"]
} }
] ]
# 标准输出列定义(与目标表结构保持一致)
STANDARD_COLUMNS = [
"稽查日期", "稽查来源", "大区", "战区", "经销商编码", "经销商名称",
"勤策门店编码", "勤策门店名称", "客户经理工号", "客户经理",
"勤策渠道大类", "稽核渠道(对N列清洗)", "城市", "渠道类型(稽查源提供)",
"产品系列", "产品口味", "产品克重", "产品价格", "是否低价", "破价价差", "低价整改状态",
"低价整改说明", "产品生产月份", "临期月份数", "临期状态", "新鲜度",
"大日期整改状态", "大日期整改说明"
]
PRODUCT_GROUPS = [ # 首行表头未识别「稽核/稽查日期」时回退的列索引(0 基)
# 第1组:虎皮凤爪 210g TEAM_WIDE_AUDIT_DATE_COL_FALLBACK = 0
{
"price_col": 6,
"flavor_cols": [7, 8, 9, 10, 11, 12, 13], def _find_wide_table_audit_col(header_row: pd.Series) -> int | None:
"series": "虎皮凤爪", """首行表头:列名含「稽核日期」或「稽查日期」的列下标(与 skiprows=1 后数据 iloc 对齐)。"""
"weight": "210g", keys = ("稽核日期", "稽查日期")
"flavors": ["卤香", "香辣", "椒麻", "火锅", "微辣", "麻辣", "黑鸭"] for i in range(len(header_row)):
}, v = header_row.iloc[i]
# 第2组:虎皮凤爪 105g if v is None or (isinstance(v, float) and pd.isna(v)):
{ continue
"price_col": 14, s = str(v).strip().replace(" ", "")
"flavor_cols": [15, 16, 17, 18, 19, 20, 21], if s and any(k in s for k in keys):
"series": "虎皮凤爪", return i
"weight": "105g", return None
"flavors": ["卤香", "香辣", "椒麻", "火锅", "微辣", "麻辣", "黑鸭"]
},
# 第3组:虎皮凤爪 68g # === 主逻辑:宽表 → 窄表并写回目标 ===
{ def main(
"price_col": 22, df_source,
"flavor_cols": [23, 24, 25, 26, 27], yname,
"series": "虎皮凤爪", pg,
"weight": "68g", target_file_path,
"flavors": ["卤香", "香辣", "椒麻", "麻辣", "黑鸭"] audit_date_str=None,
}, *,
# 第4组:鸡肉豆堡 120g source_audit_col: int = TEAM_WIDE_AUDIT_DATE_COL_FALLBACK,
{ ):
"price_col": 28, tf = Path(target_file_path) # 目标 xlsx 路径对象
"flavor_cols": [29, 30],
"series": "鸡肉豆堡",
"weight": "120g",
"flavors": ["卤香", "香辣"]
},
# 第5组:牛肉豆堡 120g
{
"price_col": 31,
"flavor_cols": [32, 33],
"series": "牛肉豆堡",
"weight": "120g",
"flavors": ["卤香", "香辣"]
},
# 第6组:去骨凤爪 72g
{
"price_col": 34,
"flavor_cols": [35, 36],
"series": "去骨凤爪",
"weight": "72g",
"flavors": ["柠檬", "香辣"]
},
# 第7组:去骨凤爪 138g
{
"price_col": 37,
"flavor_cols": [38, 39],
"series": "去骨凤爪",
"weight": "138g",
"flavors": ["柠檬", "香辣"]
},
# 第8组:虎皮小鸡腿 80g
{
"price_col": 40,
"flavor_cols": [41, 42],
"series": "虎皮小鸡腿",
"weight": "80g",
"flavors": ["卤香", "香辣"]
},
# 第9组:老卤凤爪 95g
{
"price_col": 43,
"flavor_cols": [44],
"series": "老卤凤爪",
"weight": "95g",
"flavors": ["卤香"]
},
# 第10组:老卤鸭掌 95g
{
"price_col": 45,
"flavor_cols": [46],
"series": "老卤鸭掌",
"weight": "95g",
"flavors": ["卤香"]
},
# 第11组:虎皮凤爪 25g
{
"price_col": 47,
"flavor_cols": [48, 49],
"series": "虎皮凤爪",
"weight": "25g",
"flavors": ["卤香", "香辣"]
},
# 第12组:虎皮凤爪 散称
{
"price_col": 50,
"flavor_cols": [51, 52, 53],
"series": "虎皮凤爪",
"weight": "散称",
"flavors": ["卤香", "香辣", "黑鸭"]
}
]
# === 主逻辑 ===
def main(df_source,yname,pg):
try: try:
# 获取目标表结构 try: # 尝试读已有目标工作簿
try: df_target = pd.read_excel(tf, sheet_name="合并后", dtype=str) # 合并后 sheet,全文当字符串
df_target = pd.read_excel(target_file, sheet_name="合并后", dtype=str) existing_columns = df_target.columns.tolist() # 目标列顺序,新行须对齐
existing_columns = df_target.columns.tolist() except (FileNotFoundError, ValueError): # 无文件或无该 sheet
except (FileNotFoundError, ValueError): standard_columns = [ # 新建目标时的标准表头
standard_columns = [
"稽查日期", "稽查来源", "大区", "战区", "经销商编码", "经销商名称", "稽查日期", "稽查来源", "大区", "战区", "经销商编码", "经销商名称",
"勤策门店编码", "勤策门店名称", "客户经理工号", "客户经理", "勤策门店编码", "勤策门店名称", "客户经理工号", "客户经理",
"勤策渠道大类", "稽核渠道(对N列清洗)", "城市", "渠道类型(稽查源提供)", "勤策渠道大类", "稽核渠道(对N列清洗)", "城市", "渠道类型(稽查源提供)",
...@@ -254,28 +215,38 @@ def main(df_source,yname,pg): ...@@ -254,28 +215,38 @@ def main(df_source,yname,pg):
"低价整改说明", "产品生产月份", "临期月份数", "临期状态", "新鲜度", "低价整改说明", "产品生产月份", "临期月份数", "临期状态", "新鲜度",
"大日期整改状态", "大日期整改说明" "大日期整改状态", "大日期整改说明"
] ]
df_target = pd.DataFrame(columns=standard_columns) df_target = pd.DataFrame(columns=standard_columns) # 空表占位
existing_columns = standard_columns existing_columns = standard_columns # 列名列表同上
records = [] ad, ad_err = _resolve_audit_date(
audit_date_str,
# 处理每一行 df_target,
for idx, row in df_source.iterrows(): df_source,
# 提取基础字段(B~F) source_audit_col=source_audit_col,
base_data = { )
"勤策门店编码": str(row.iloc[1]).strip() if pd.notna(row.iloc[1]) else "", if ad_err: # 无法得到稽查日期
"城市": str(row.iloc[2]).strip() if pd.notna(row.iloc[2]) else "", print(f"❌ {ad_err}") # 控制台提示
"勤策门店名称": str(row.iloc[3]).strip() if pd.notna(row.iloc[3]) else "", return {"ok": False, "error": ad_err} # 提前返回
"经销商名称": str(row.iloc[4]).strip() if pd.notna(row.iloc[4]) else "",
"渠道类型": str(row.iloc[5]).strip() if pd.notna(row.iloc[5]) else "", records = [] # 收集本批生成的窄表行(字典)
src_has_audit_col = df_source.shape[1] > source_audit_col
for idx, row in df_source.iterrows(): # 遍历源表每一门店行
base_data = { # 从源固定列位抽取门店维度(列号与团队源表约定一致)
"勤策门店编码": str(row.iloc[8]).strip() if pd.notna(row.iloc[8]) else "", # 第 9 列
"城市": str(row.iloc[4]).strip() if pd.notna(row.iloc[4]) else "",
"勤策门店名称": str(row.iloc[9]).strip() if pd.notna(row.iloc[9]) else "",
"经销商名称": str(row.iloc[7]).strip() if pd.notna(row.iloc[7]) else "",
"渠道类型": str(row.iloc[10]).strip() if pd.notna(row.iloc[10]) else "",
} }
# 构建基础行(不含产品信息) base_row = {} # 目标表一行骨架(仅含目标里有的列)
base_row = {} if COLUMN_MAPPING["稽查日期"] in existing_columns: # 目标有稽查日期列
if COLUMN_MAPPING["稽查日期"] in existing_columns: row_ad = to_yyyy_mm_dd(row.iloc[source_audit_col]) if src_has_audit_col else None
base_row[COLUMN_MAPPING["稽查日期"]] = current_date # 该门店宽表行展开的多条窄表行共用本行稽核日期
base_row[COLUMN_MAPPING["稽查日期"]] = row_ad or ad
if COLUMN_MAPPING["稽查来源"] in existing_columns: if COLUMN_MAPPING["稽查来源"] in existing_columns:
base_row[COLUMN_MAPPING["稽查来源"]] = yname base_row[COLUMN_MAPPING["稽查来源"]] = yname # 如「稽查团队」
if COLUMN_MAPPING["勤策门店编码"] in existing_columns: if COLUMN_MAPPING["勤策门店编码"] in existing_columns:
base_row[COLUMN_MAPPING["勤策门店编码"]] = base_data["勤策门店编码"] base_row[COLUMN_MAPPING["勤策门店编码"]] = base_data["勤策门店编码"]
if COLUMN_MAPPING["勤策门店名称"] in existing_columns: if COLUMN_MAPPING["勤策门店名称"] in existing_columns:
...@@ -287,70 +258,65 @@ def main(df_source,yname,pg): ...@@ -287,70 +258,65 @@ def main(df_source,yname,pg):
if COLUMN_MAPPING["渠道类型"] in existing_columns: if COLUMN_MAPPING["渠道类型"] in existing_columns:
base_row[COLUMN_MAPPING["渠道类型"]] = base_data["渠道类型"] base_row[COLUMN_MAPPING["渠道类型"]] = base_data["渠道类型"]
# 处理每一组产品 for group in pg: # 当前门店下每个产品组展开多行
for group in pg: price_col = group["price_col"] # 组内价格列索引
price_col = group["price_col"] flavor_cols = group["flavor_cols"] # 组内各口味月份列
flavor_cols = group["flavor_cols"] flavors = group["flavors"] # 口味名称列表
flavors = group["flavors"] series = group["series"] # 系列名
series = group["series"] weight = group["weight"] # 克重
weight = group["weight"]
src_price = str(row.iloc[price_col]).strip() if pd.notna(row.iloc[price_col]) else "" # 读价格单元格
if not flavor_cols: if not src_price or src_price == '无价签': # 空或占位文案视为无价格
print("⚠️ 未找到任何口味列!") src_price = '' # 统一为空串
# 获取该组价格
src_price = str(row.iloc[price_col]).strip() if pd.notna(row.iloc[price_col]) else "" row_with_price = copy.deepcopy(base_row) # 每组单独拷贝,避免串味
if not src_price or src_price == '无价签':
src_price = ''
# 设置价格到基础行副本(仅用于本组)
row_with_price = copy.deepcopy(base_row)
if COLUMN_MAPPING["产品价格"] in existing_columns: if COLUMN_MAPPING["产品价格"] in existing_columns:
row_with_price[COLUMN_MAPPING["产品价格"]] = src_price row_with_price[COLUMN_MAPPING["产品价格"]] = src_price # 写入该组价格
# 处理该组的7个口味 for i, col_idx in enumerate(flavor_cols): # 该组每个口味一列
for i, col_idx in enumerate(flavor_cols): flavor_name = flavors[i] # 与列索引对齐的口味名
flavor_name = flavors[i] src_month = str(row.iloc[col_idx]).strip() if pd.notna(row.iloc[col_idx]) else "" # 生产月份原文
src_month = str(row.iloc[col_idx]).strip() if pd.notna(row.iloc[col_idx]) else ""
# 情况1: 有生产月份 → 必须生成记录 if src_month: # 有月份则必有窄表行(业务规则)
if src_month: new_rec = copy.deepcopy(row_with_price) # 再拷贝一行给本口味
new_rec = copy.deepcopy(row_with_price) src_month = normalize_year_month_to_day01(src_month) # 规范为 yyyy-mm-dd 串
# 修改src_month格式 _set_product_fields(new_rec, series, flavor_name, weight, src_month, existing_columns) # 填产品列
src_month = normalize_month(src_month) rDate(new_rec) # 算临期/新鲜度等
_set_product_fields(new_rec, series, flavor_name, weight, src_month, existing_columns) records.append(new_rec) # 入结果列表
rDate(new_rec)
records.append(new_rec)
# 情况2: 无生产月份但有价格 → 生成记录(生产月份留空) elif src_price: # 无月份但有价格也要一行,月份空
elif src_price:
new_rec = copy.deepcopy(row_with_price) new_rec = copy.deepcopy(row_with_price)
_set_product_fields(new_rec, series, flavor_name, weight, None, existing_columns) _set_product_fields(new_rec, series, flavor_name, weight, None, existing_columns) # 不设生产月份
rDate(new_rec) rDate(new_rec) # 仍跑一遍(内部会因缺日期清空临期字段)
records.append(new_rec) records.append(new_rec)
if not records: if not records: # 没有任何可写行
print("⚠️ 无有效数据需要追加。") msg = "无有效数据需要追加。"
return print(f"⚠️ {msg}")
return {"ok": False, "message": msg}
df_new = pd.DataFrame(records, columns=existing_columns) df_new = pd.DataFrame(records, columns=existing_columns) # 新行 DataFrame,列顺序与目标一致
df_combined = pd.concat([df_target, df_new], ignore_index=True) df_combined = pd.concat([df_target, df_new], ignore_index=True) # 旧数据 + 新数据
# 判断目标文件是否存在 if os.path.exists(tf): # 目标文件已在磁盘上
if os.path.exists(target_file): with pd.ExcelWriter(tf, engine='openpyxl', mode='a', if_sheet_exists='replace') as writer: # 追加模式替换 sheet
# 文件存在:以追加模式打开,替换 "合并后" sheet df_combined.to_excel(writer, sheet_name="合并后", index=False) # 不写行索引
with pd.ExcelWriter(target_file, engine='openpyxl', mode='a', if_sheet_exists='replace') as writer: else: # 首次创建文件
df_combined.to_excel(writer, sheet_name="合并后", index=False) with pd.ExcelWriter(tf, engine='openpyxl', mode='w') as writer: # 新建工作簿
else:
# 文件不存在:创建新文件,只写入 "合并后" sheet
with pd.ExcelWriter(target_file, engine='openpyxl', mode='w') as writer:
df_combined.to_excel(writer, sheet_name="合并后", index=False) df_combined.to_excel(writer, sheet_name="合并后", index=False)
print(f"✅ 成功追加 {len(records)} 条记录到目标表!") print(f"✅ 成功追加 {len(records)} 条记录到目标表!")
return {
"ok": True,
"records_added": len(records), # 新增行数
"target_file": str(tf), # 绝对/相对路径字符串
}
except Exception as e: except Exception as e: # 未预料异常
print(f"❌ 错误: {e}") print(f"❌ 错误: {e}")
import traceback import traceback # 延迟导入,仅出错时打印栈
traceback.print_exc() traceback.print_exc()
return {"ok": False, "error": str(e)}
def _set_product_fields(record, series, flavor, weight, prod_month_str, existing_columns): def _set_product_fields(record, series, flavor, weight, prod_month_str, existing_columns):
...@@ -361,54 +327,49 @@ def _set_product_fields(record, series, flavor, weight, prod_month_str, existing ...@@ -361,54 +327,49 @@ def _set_product_fields(record, series, flavor, weight, prod_month_str, existing
record[COLUMN_MAPPING["产品口味"]] = flavor record[COLUMN_MAPPING["产品口味"]] = flavor
if COLUMN_MAPPING["产品克重"] in existing_columns: if COLUMN_MAPPING["产品克重"] in existing_columns:
record[COLUMN_MAPPING["产品克重"]] = weight record[COLUMN_MAPPING["产品克重"]] = weight
if prod_month_str and COLUMN_MAPPING["产品生产月份"] in existing_columns: if prod_month_str and COLUMN_MAPPING["产品生产月份"] in existing_columns: # 有月份字符串才解析
# record[COLUMN_MAPPING["产品生产月份"]] = prod_month_str
# record[COLUMN_MAPPING["产品生产月份"]] = pd.to_datetime(prod_month_str)
try: try:
#TODO: 假设 prod_month_str 是 "yyyy-mm-dd" 字符串 dt = datetime.strptime(prod_month_str, "%Y-%m-%d") # 期望 normalize_year_month_to_day01 已输出此格式
dt = datetime.strptime(prod_month_str, "%Y-%m-%d") record[COLUMN_MAPPING["产品生产月份"]] = dt.date() # rDate 用 date 与保质期相加
record[COLUMN_MAPPING["产品生产月份"]] = dt.date() # 👈 关键:转为 date except (ValueError, TypeError): # 解析失败
except (ValueError, TypeError): record[COLUMN_MAPPING["产品生产月份"]] = None # 置空避免脏数据
# 如果解析失败,保留原值或设为空
record[COLUMN_MAPPING["产品生产月份"]] = None
def rDate(row_dict): def rDate(row_dict):
"""计算临期状态(保持你原有的业务逻辑)""" """计算临期状态(保持你原有的业务逻辑)"""
# TODO: prod_month_str = row_dict.get("产品生产月份", "").strip() prod_date = row_dict.get("产品生产月份", None) # date 或 None
prod_date = row_dict.get("产品生产月份", None) inspect_date_str = row_dict.get("稽查日期", "").strip() # 字符串 YYYY-MM-DD
inspect_date_str = row_dict.get("稽查日期", "").strip()
if not prod_date or not inspect_date_str: if not prod_date or not inspect_date_str: # 缺任一无法算临期
row_dict["临期状态"] = "" row_dict["临期状态"] = ""
row_dict["新鲜度"] = "" row_dict["新鲜度"] = ""
row_dict["临期月份数"] = "" row_dict["临期月份数"] = ""
return return
try: try:
# TODO: prod_date = datetime.strptime(prod_month_str, "%Y-%m-%d") inspect_date = datetime.strptime(inspect_date_str, "%Y-%m-%d") # 稽查日转 datetime
inspect_date = datetime.strptime(inspect_date_str, "%Y-%m-%d") except ValueError: # 稽查日期格式不对
except ValueError:
row_dict["临期状态"] = "" row_dict["临期状态"] = ""
row_dict["新鲜度"] = "" row_dict["新鲜度"] = ""
row_dict["临期月份数"] = "" row_dict["临期月份数"] = ""
return return
product_series = row_dict.get("产品系列", "") product_series = row_dict.get("产品系列", "") # 系列决定保质期月数
zg_status = "未整改" zg_status = "未整改" # 大日期整改状态默认值
if product_series == "去骨凤爪": if product_series == "去骨凤爪": # 6 个月保质期规则
expiry_date = prod_date + relativedelta(months=6) expiry_date = prod_date + relativedelta(months=6) # 到期日
gap_months = _calculate_gap_months(expiry_date, inspect_date) gap_months = approx_gap_months_calendar(expiry_date, inspect_date)
if gap_months >= 2: if gap_months >= 2:
status, freshness,zg_status = "非大日期", "高","" status, freshness,zg_status = "非大日期", "高","" # 整改状态清空表示无需整改
elif 1 <= gap_months < 2: elif 1 <= gap_months < 2:
status, freshness = "大日期", "低" status, freshness = "大日期", "低"
elif 0 <= gap_months < 1: elif 0 <= gap_months < 1:
status, freshness = "临期", "低" status, freshness = "临期", "低"
else: else:
status, freshness = "过期", "低" status, freshness = "过期", "低"
else: else: # 默认 9 个月保质期
expiry_date = prod_date + relativedelta(months=9) expiry_date = prod_date + relativedelta(months=9)
gap_months = _calculate_gap_months(expiry_date, inspect_date) gap_months = approx_gap_months_calendar(expiry_date, inspect_date)
if gap_months >= 3: if gap_months >= 3:
status, freshness,zg_status = "非大日期", "高","" status, freshness,zg_status = "非大日期", "高",""
elif 1 <= gap_months < 3: elif 1 <= gap_months < 3:
...@@ -418,145 +379,104 @@ def rDate(row_dict): ...@@ -418,145 +379,104 @@ def rDate(row_dict):
else: else:
status, freshness = "过期", "低" status, freshness = "过期", "低"
row_dict["临期状态"] = status row_dict["临期状态"] = status # 写回行字典
row_dict["新鲜度"] = freshness row_dict["新鲜度"] = freshness
row_dict["临期月份数"] = round(gap_months, 2) row_dict["临期月份数"] = round(gap_months, 2) # 保留两位小数
row_dict["大日期整改状态"] = zg_status row_dict["大日期整改状态"] = zg_status
def _calculate_gap_months(expiry_date, inspect_date): def read_team_source_from_url(
diff_years = expiry_date.year - inspect_date.year url: str,
diff_months = expiry_date.month - inspect_date.month *,
diff_days = expiry_date.day - inspect_date.day timeout: float = 300,
return diff_years * 12 + diff_months + diff_days / 30.0 user_agent: str = "clean-data-api/1.0",
dtype=str,
import re ) -> tuple[pd.DataFrame, int]:
"""团队宽表:跳过首行标题;返回 (数据, 稽核/稽查日期列下标)。"""
# todo 这里还需要修改 data_df, header_row = read_excel_from_url_skip1_with_header_row(
def normalize_month(src_month): url,
""" timeout=timeout,
将生产月份字符串标准化为 'yyyy-mm' 格式。 user_agent=user_agent,
dtype=dtype,
支持的输入格式: )
- 'yyyy-mm'(如 '2025-12')→ 保持不变 col = _find_wide_table_audit_col(header_row)
- 'yyyymm'(如 '202512')→ 转为 '2025-12' if col is None:
col = TEAM_WIDE_AUDIT_DATE_COL_FALLBACK
其他格式或无效值返回原值(或可选返回空字符串) print(
""" f"⚠️ 宽表首行未识别「稽核日期/稽查日期」列,稽查日期回退用列索引 {col}"
if not isinstance(src_month, str): )
return src_month # 非字符串直接返回 else:
print(f"✅ 宽表稽核/稽查日期列索引 {col},表头: {header_row.iloc[col]!r}")
src_month = src_month.strip() return data_df, col
if not src_month:
return src_month
def _print_source_preview(df_source_p: pd.DataFrame) -> None:
# 情况1: 已是 yyyy-mm 格式(例如 2025-12) print(f"✅ 成功读取 {len(df_source_p)} 行数据。") # 行数
if re.fullmatch(r'\d{4}-\d{1,2}', src_month): if len(df_source_p) > 0: # 有数据才预览
# 可选:统一补零为两位月(如 2025-1 → 2025-01) print("前 2 行数据预览(确认第 2 行是否在列):")
year, month = src_month.split('-') print(df_source_p.head(2)) # 前两行
month = month.zfill(2) # 确保月份两位 print(f"列索引范围:0 到 {len(df_source_p.columns) - 1}") # 列数提示
return f"{year}-{month}-01"
# 情况2: 是 yyyymm 格式(6位数字,如 202512) def _run_team_after_load(
if re.fullmatch(r'\d{6}', src_month): df_source_p: pd.DataFrame,
year = src_month[:4] target_path: str | Path,
month = src_month[4:].lstrip('0') or '0' # 防止全零 audit_date_str: str | None,
month = src_month[4:].zfill(2) # 直接取后两位并确保两位(更安全) yname: str = "稽查团队",
return f"{year}-{month}-01" product_groups: list | None = None,
*,
# 其他格式:不处理(或可根据需求返回空) source_audit_col: int = TEAM_WIDE_AUDIT_DATE_COL_FALLBACK,
return src_month ) -> dict:
pg = product_groups if product_groups is not None else PRODUCT_GROUPS_JC # 产品组配置
def transform(df_source, yname, pg, audit_date: str = None): result = main( # 执行核心转换
""" df_source_p,
供 API 调用的数据转换入口:接收 DataFrame,返回清洗后的记录列表,不读写任何文件。 yname,
pg,
Args: target_file_path=target_path,
df_source: pandas DataFrame,列通过 iloc 按位置访问(header=2 读入后索引从 0 开始) audit_date_str=audit_date_str,
yname: 稽查来源名称,如 '浦零' 或 '诚予' source_audit_col=source_audit_col,
pg: 产品组配置列表(PRODUCT_GROUPS 或 PRODUCT_GROUPS_CY) )
audit_date: 稽查日期字符串,格式 'yyyy-mm-dd';为 None 时自动取上月1号 if result is None: # 防御性判断
return {"ok": False, "error": "main 未返回结果"}
Returns: return {"source_rows": len(df_source_p), **result} # 附带源行数
list[dict]: 按 STANDARD_COLUMNS 结构整理好的记录列表(产品生产月份为字符串)
"""
from datetime import date as date_type def run_team_conversion(
source_url: str,
if audit_date is None: target_path: str | Path,
audit_date = (datetime.now().replace(day=1) - relativedelta(months=1)).strftime("%Y-%m-01") audit_date_str: str | None = None,
*,
records = [] yname: str = "稽查团队",
product_groups: list | None = None,
for idx, row in df_source.iterrows(): timeout: float = 300,
base_data = { user_agent: str = "clean-data-api/1.0",
"勤策门店编码": str(row.iloc[1]).strip() if pd.notna(row.iloc[1]) else "", dtype=str,
"城市": str(row.iloc[2]).strip() if pd.notna(row.iloc[2]) else "", ) -> dict:
"勤策门店名称": str(row.iloc[3]).strip() if pd.notna(row.iloc[3]) else "", """从 source_url 下载团队宽表 xlsx,转换后写入 target_path。"""
"经销商名称": str(row.iloc[4]).strip() if pd.notna(row.iloc[4]) else "", s = (source_url or "").strip()
"渠道类型": str(row.iloc[5]).strip() if pd.notna(row.iloc[5]) else "", low = s.lower()
} if not s or not (low.startswith("http://") or low.startswith("https://")):
return {"ok": False, "error": "source_url 须为非空的 http(s) 地址"}
base_row = {} print("正在从 URL 读取【团队】源文件(跳过第 1 行标题,第 2 行作为数据第 1 行)...")
if COLUMN_MAPPING["稽查日期"] in STANDARD_COLUMNS: try:
base_row[COLUMN_MAPPING["稽查日期"]] = audit_date df_source_p, source_audit_col = read_team_source_from_url(
if COLUMN_MAPPING["稽查来源"] in STANDARD_COLUMNS: s,
base_row[COLUMN_MAPPING["稽查来源"]] = yname timeout=timeout,
if COLUMN_MAPPING["勤策门店编码"] in STANDARD_COLUMNS: user_agent=user_agent,
base_row[COLUMN_MAPPING["勤策门店编码"]] = base_data["勤策门店编码"] dtype=dtype,
if COLUMN_MAPPING["勤策门店名称"] in STANDARD_COLUMNS: )
base_row[COLUMN_MAPPING["勤策门店名称"]] = base_data["勤策门店名称"] except urllib.error.HTTPError as e:
if COLUMN_MAPPING["经销商名称"] in STANDARD_COLUMNS: return {"ok": False, "error": f"从 URL 读取源表失败: HTTP {e.code}"}
base_row[COLUMN_MAPPING["经销商名称"]] = base_data["经销商名称"] except urllib.error.URLError as e:
if COLUMN_MAPPING["城市"] in STANDARD_COLUMNS: return {"ok": False, "error": f"从 URL 读取源表失败: {e.reason!s}"}
base_row[COLUMN_MAPPING["城市"]] = base_data["城市"] except Exception as e:
if COLUMN_MAPPING["渠道类型"] in STANDARD_COLUMNS: return {"ok": False, "error": f"读取源表失败: {e}"}
base_row[COLUMN_MAPPING["渠道类型"]] = base_data["渠道类型"] _print_source_preview(df_source_p)
return _run_team_after_load(
for group in pg: df_source_p,
price_col = group["price_col"] target_path,
flavor_cols = group["flavor_cols"] audit_date_str,
flavors = group["flavors"] yname=yname,
series = group["series"] product_groups=product_groups,
weight = group["weight"] source_audit_col=source_audit_col,
)
src_price = str(row.iloc[price_col]).strip() if pd.notna(row.iloc[price_col]) else ""
if not src_price or src_price == '无价签':
src_price = ''
row_with_price = copy.deepcopy(base_row)
if COLUMN_MAPPING["产品价格"] in STANDARD_COLUMNS:
row_with_price[COLUMN_MAPPING["产品价格"]] = src_price
for i, col_idx in enumerate(flavor_cols):
flavor_name = flavors[i]
src_month = str(row.iloc[col_idx]).strip() if pd.notna(row.iloc[col_idx]) else ""
if src_month:
new_rec = copy.deepcopy(row_with_price)
src_month = normalize_month(src_month)
_set_product_fields(new_rec, series, flavor_name, weight, src_month, STANDARD_COLUMNS)
rDate(new_rec)
records.append(new_rec)
elif src_price:
new_rec = copy.deepcopy(row_with_price)
_set_product_fields(new_rec, series, flavor_name, weight, None, STANDARD_COLUMNS)
rDate(new_rec)
records.append(new_rec)
# 将 date 对象统一转为 ISO 字符串,保证 JSON 可序列化
for rec in records:
for k, v in rec.items():
if isinstance(v, date_type):
rec[k] = v.isoformat()
return records
if __name__ == "__main__":
# TODO: 配置sheet页名称
print("正在读取【浦零】源文件(跳过前三行)...")
df_source_p = pd.read_excel(source_file, header=2, dtype=str)
main(df_source_p,'浦零',PRODUCT_GROUPS)
#print("正在读取【诚予】源文件(跳过前三行)...")
#df_source_c = pd.read_excel(source_file_cy, sheet_name="Sheet1", header=2, dtype=str)
#main(df_source_c,'诚予',PRODUCT_GROUPS_CY)
fastapi==0.104.1 fastapi>=0.115.0
uvicorn==0.24.0 uvicorn[standard]>=0.32.0
python-multipart==0.0.6
openpyxl==3.1.5 # 数据转换_团队.py
requests==2.31.0
aiohttp==3.9.1
mysql-connector-python==8.2.0
pydantic==2.4.2
python-dotenv==1.0.0
pandas>=2.0.0 pandas>=2.0.0
python-dateutil>=2.8.2 openpyxl>=3.1.0
\ No newline at end of file python-dateutil>=2.8.0
/*
Navicat MySQL Data Transfer
Source Server : t100_dev
Source Server Version : 50744
Source Host : 192.168.100.39:25301
Source Database : market_bi
Target Server Type : MYSQL
Target Server Version : 50744
File Encoding : 65001
Date: 2026-03-09 18:13:42
*/
SET FOREIGN_KEY_CHECKS=0;
-- ----------------------------
-- Table structure for risk_audit_visit
-- ----------------------------
DROP TABLE IF EXISTS `risk_audit_visit`;
CREATE TABLE `risk_audit_visit` (
`rav_id` int(11) NOT NULL AUTO_INCREMENT COMMENT '主键',
`audit_date` date DEFAULT NULL COMMENT '稽查日期',
`source` varchar(20) DEFAULT NULL COMMENT '稽查来源',
`region_name` varchar(20) DEFAULT NULL COMMENT '大区',
`district_name` varchar(20) DEFAULT NULL COMMENT '战区',
`dealer_code` varchar(10) DEFAULT NULL COMMENT '经销商编码',
`dealer_name` varchar(100) DEFAULT NULL COMMENT '经销商名称',
`store_code` varchar(20) DEFAULT NULL COMMENT '门店编码',
`store_name` varchar(100) DEFAULT NULL COMMENT '勤策门店',
`f_emp_no` varchar(20) DEFAULT NULL COMMENT '客户经理工号',
`f_emp_name` varchar(100) DEFAULT NULL COMMENT '客户经理名称',
`qin_ce_type_large` varchar(20) DEFAULT NULL COMMENT '勤策渠道大类',
`jh_channel_type` varchar(20) DEFAULT NULL COMMENT '稽查渠道类型',
`city` varchar(30) DEFAULT NULL COMMENT '城市',
`channel_type` varchar(30) DEFAULT NULL COMMENT '渠道类型(稽查源提供)',
`series` varchar(20) DEFAULT NULL COMMENT '产品系列',
`taste` varchar(20) DEFAULT NULL COMMENT '产品口味',
`weight` varchar(20) DEFAULT NULL COMMENT '产品克重',
`price` decimal(10,2) DEFAULT NULL COMMENT '产品价格',
`low_price` varchar(20) DEFAULT NULL COMMENT '是否低价:低价,正常',
`low_price_diff` decimal(10,2) DEFAULT NULL COMMENT '价差',
`low_price_status` varchar(20) DEFAULT NULL COMMENT '低价整改状态',
`low_price_rectify` varchar(100) DEFAULT NULL COMMENT '低价整改说明',
`production_month` date DEFAULT NULL COMMENT '产品生产月份',
`near_month_num` int(11) DEFAULT NULL COMMENT '临期月份数',
`near_month_status` varchar(20) DEFAULT NULL COMMENT '临期状态',
`fresh_status` varchar(20) DEFAULT NULL COMMENT '新鲜度',
`large_date_status` varchar(20) DEFAULT NULL COMMENT '大日期整改状态',
`large_date_rectify` varchar(100) DEFAULT NULL COMMENT '大日期整改说明',
PRIMARY KEY (`rav_id`),
-- 业务唯一键:同一稽查日期 + 来源 + 门店名称 + 渠道类型(稽查源提供)+ 产品系列 + 口味 + 克重 = 唯一一条记录
-- ON DUPLICATE KEY UPDATE 依赖此唯一键判断是执行 INSERT 还是覆盖 UPDATE
UNIQUE KEY `uk_biz` (`audit_date`,`source`,`store_name`(100),`channel_type`,`series`,`taste`,`weight`),
KEY `audit` (`audit_date`),
KEY `dealer` (`dealer_code`,`dealer_name`),
KEY `product_index` (`series`,`taste`,`weight`),
KEY `regiondistrict` (`region_name`,`district_name`),
KEY `type_small` (`jh_channel_type`),
KEY `weight_index` (`weight`)
) ENGINE=InnoDB AUTO_INCREMENT=493621 DEFAULT CHARSET=utf8mb4 COMMENT='稽查走访价格大日期表';
"""
API 测试脚本
用于快速测试 API 的各个端点
"""
import asyncio
import httpx
import json
from datetime import datetime
BASE_URL = "http://localhost:8000"
class APITester:
"""API 测试类"""
def __init__(self, base_url: str = BASE_URL):
self.base_url = base_url
self.task_id: str = None
async def test_health_check(self):
"""测试健康检查接口"""
print("\n" + "="*50)
print("测试:健康检查接口")
print("="*50)
try:
async with httpx.AsyncClient() as client:
response = await client.get(f"{self.base_url}/api/v1/health")
print(f"状态码: {response.status_code}")
print(f"响应: {json.dumps(response.json(), indent=2, ensure_ascii=False)}")
except Exception as e:
print(f"错误: {str(e)}")
async def test_start_cleaning(self):
"""测试启动清洗任务接口"""
print("\n" + "="*50)
print("测试:启动数据清洗任务")
print("="*50)
payload = {
"excel_url": "https://example.com/test_data.xlsx",
"department": "sales",
"description": "测试数据清洗"
}
try:
async with httpx.AsyncClient() as client:
response = await client.post(
f"{self.base_url}/api/v1/clean",
json=payload
)
print(f"状态码: {response.status_code}")
data = response.json()
print(f"响应: {json.dumps(data, indent=2, ensure_ascii=False)}")
if response.status_code == 200:
self.task_id = data.get('task_id')
print(f"\n✓ 任务创建成功,Task ID: {self.task_id}")
except Exception as e:
print(f"错误: {str(e)}")
async def test_get_progress(self):
"""测试获取进度接口"""
if not self.task_id:
print("跳过:需要先创建任务")
return
print("\n" + "="*50)
print("测试:获取数据清洗进度")
print("="*50)
try:
async with httpx.AsyncClient() as client:
response = await client.get(
f"{self.base_url}/api/v1/progress/{self.task_id}"
)
print(f"状态码: {response.status_code}")
print(f"响应: {json.dumps(response.json(), indent=2, ensure_ascii=False, default=str)}")
except Exception as e:
print(f"错误: {str(e)}")
async def test_get_result(self):
"""测试获取清洗结果接口"""
if not self.task_id:
print("跳过:需要先创建任务")
return
print("\n" + "="*50)
print("测试:获取清洗结果")
print("="*50)
try:
async with httpx.AsyncClient() as client:
response = await client.get(
f"{self.base_url}/api/v1/result/{self.task_id}"
)
print(f"状态码: {response.status_code}")
data = response.json()
print(f"响应: {json.dumps(data, indent=2, ensure_ascii=False, default=str)}")
except Exception as e:
print(f"错误: {str(e)}")
async def test_save_data(self):
"""测试保存数据接口"""
if not self.task_id:
print("跳过:需要先创建任务")
return
print("\n" + "="*50)
print("测试:保存清洗后的数据")
print("="*50)
payload = {
"task_id": self.task_id,
"table_name": "sales_data"
}
try:
async with httpx.AsyncClient() as client:
response = await client.post(
f"{self.base_url}/api/v1/save",
json=payload
)
print(f"状态码: {response.status_code}")
print(f"响应: {json.dumps(response.json(), indent=2, ensure_ascii=False)}")
except Exception as e:
print(f"错误: {str(e)}")
async def run_all_tests(self):
"""运行所有测试"""
print("\n")
print("╔" + "="*48 + "╗")
print("║" + " "*10 + "数据清洗系统 API 测试" + " "*16 + "║")
print("║" + f" "*10 + f"时间: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}" + " "*15 + "║")
print("╚" + "="*48 + "╝")
await self.test_health_check()
await asyncio.sleep(1)
await self.test_start_cleaning()
await asyncio.sleep(2)
await self.test_get_progress()
await asyncio.sleep(1)
await self.test_get_result()
await asyncio.sleep(1)
print("\n" + "="*50)
print("所有测试完成!")
print("="*50 + "\n")
async def main():
"""主函数"""
tester = APITester()
await tester.run_all_tests()
if __name__ == "__main__":
print("\n提示:确保 FastAPI 服务已在 http://localhost:8000 运行中\n")
asyncio.run(main())
"""Utils 工具模块""" # 跨业务复用的小工具(日期、网络 Excel 等)
from utils.response import BizCode, ApiResponse, ok_resp, fail_resp
__all__ = ["BizCode", "ApiResponse", "ok_resp", "fail_resp"]
"""日期解析与 DataFrame 中取首个有效日期(与具体业务表头通过参数解耦)。"""
from __future__ import annotations
import re
from collections.abc import Sequence
from datetime import datetime
import pandas as pd
def _parse_yyyymmdd(s: str) -> str | None:
"""8 位 YYYYMMDD → YYYY-MM-DD;非法日历则 None。"""
if not re.fullmatch(r"\d{8}", s):
return None
try:
return datetime.strptime(s, "%Y%m%d").strftime("%Y-%m-%d")
except ValueError:
return None
def to_yyyy_mm_dd(val) -> str | None:
"""任意单元格值 → YYYY-MM-DD;无法解析则 None。"""
if val is None or (isinstance(val, float) and pd.isna(val)):
return None
if isinstance(val, str):
y = _parse_yyyymmdd(val.strip())
if y:
return y
if isinstance(val, int) and val >= 0:
s = str(val)
if len(s) == 8:
y = _parse_yyyymmdd(s)
if y:
return y
if isinstance(val, float) and val.is_integer() and val >= 0:
s = str(int(val))
if len(s) == 8:
y = _parse_yyyymmdd(s)
if y:
return y
ts = pd.to_datetime(val, errors="coerce")
if pd.isna(ts):
return None
return ts.strftime("%Y-%m-%d")
def first_yyyy_mm_dd_in_iloc(df: pd.DataFrame, col_idx: int) -> str | None:
"""自上而下取第 col_idx 列首个可解析日期(宽表无表头时常用)。"""
if df is None or df.shape[1] <= col_idx or col_idx < 0:
return None
for val in df.iloc[:, col_idx]:
n = to_yyyy_mm_dd(val)
if n:
return n
return None
def first_yyyy_mm_dd_in_dataframe(
df: pd.DataFrame,
column_names: Sequence[str],
*,
third_column_fallback: bool = True,
) -> str | None:
"""按列名顺序找第一列,自上而下取首个可解析为日期的值;若无匹配列且允许则用第 3 列(下标 2)。"""
ser = None
if df is not None and df.shape[1] > 0:
for name in column_names:
if name in df.columns:
ser = df[name]
break
if ser is None and third_column_fallback and df.shape[1] > 2:
ser = df.iloc[:, 2]
if ser is None:
return None
for val in ser:
n = to_yyyy_mm_dd(val)
if n:
return n
return None
def normalize_year_month_to_day01(src_month):
"""
生产月份类字符串 → YYYY-MM-01(供后续 strptime %Y-%m-%d)。
支持 yyyy-mm、yyyymm;其它类型/格式原样返回。
"""
if not isinstance(src_month, str):
return src_month
src_month = src_month.strip()
if not src_month:
return src_month
if re.fullmatch(r"\d{4}-\d{1,2}", src_month):
year, month = src_month.split("-")
return f"{year}-{month.zfill(2)}-01"
if re.fullmatch(r"\d{6}", src_month):
year = src_month[:4]
month = src_month[4:].zfill(2)
return f"{year}-{month}-01"
return src_month
def approx_gap_months_calendar(expiry_date, inspect_date) -> float:
"""到期日相对检查日的剩余月数近似值(与原业务公式一致:年*12+月+日/30)。"""
diff_years = expiry_date.year - inspect_date.year
diff_months = expiry_date.month - inspect_date.month
diff_days = expiry_date.day - inspect_date.day
return diff_years * 12 + diff_months + diff_days / 30.0
"""从 URL 下载 Excel 到内存并用 pandas 解析(不写本地临时文件)。"""
from __future__ import annotations
import io
import urllib.request
import pandas as pd
def read_excel_from_url(
url: str,
*,
timeout: float = 300,
user_agent: str = "clean-data-api/1.0",
skiprows: int = 0,
header=None,
dtype=str,
) -> pd.DataFrame:
req = urllib.request.Request(url.strip(), headers={"User-Agent": user_agent})
with urllib.request.urlopen(req, timeout=timeout) as resp:
raw = resp.read()
return pd.read_excel(io.BytesIO(raw), skiprows=skiprows, header=header, dtype=dtype)
def read_excel_from_url_skip1_with_header_row(
url: str,
*,
timeout: float = 300,
user_agent: str = "clean-data-api/1.0",
dtype=str,
) -> tuple[pd.DataFrame, pd.Series]:
"""跳过第 1 行后的数据 + 被跳过的第 1 行(表头,0 基列与数据列对齐)。"""
req = urllib.request.Request(url.strip(), headers={"User-Agent": user_agent})
with urllib.request.urlopen(req, timeout=timeout) as resp:
raw = resp.read()
buf = io.BytesIO(raw)
header_df = pd.read_excel(buf, header=None, dtype=dtype, nrows=1)
buf.seek(0)
data_df = pd.read_excel(buf, skiprows=1, header=None, dtype=dtype)
return data_df, header_df.iloc[0]
"""
异常定义模块
"""
class DataCleaningException(Exception):
"""数据清洗异常"""
pass
class DatabaseException(Exception):
"""数据库异常"""
pass
class ExcelParsingException(Exception):
"""Excel 解析异常"""
pass
class ValidationException(Exception):
"""验证异常"""
pass
"""
统一响应格式封装模块
所有接口统一返回: { code: 业务状态码, msg: 消息, data: 数据 }
"""
from enum import IntEnum
from typing import Any
from fastapi.responses import JSONResponse
from pydantic import BaseModel
class BizCode(IntEnum):
"""业务逻辑状态码"""
SUCCESS = 200 # 通用成功
TASK_QUEUED = 201 # 任务已入队(异步场景)
TASK_PROCESSING = 202 # 任务处理中
BAD_REQUEST = 400 # 请求参数错误
NOT_FOUND = 404 # 资源不存在
TASK_FAILED = 422 # 任务执行失败(业务层)
SERVER_ERROR = 500 # 服务器内部错误
DB_ERROR = 501 # 数据库错误
EXCEL_ERROR = 502 # Excel 解析错误
class ApiResponse(BaseModel):
"""统一 API 响应体"""
code: int
msg: str
data: Any = None
def ok_resp(data: Any = None, msg: str = "success") -> JSONResponse:
"""返回成功的 JSONResponse(HTTP 200)"""
return JSONResponse(
status_code=200,
content=ApiResponse(code=BizCode.SUCCESS, msg=msg, data=data).model_dump()
)
def fail_resp(
biz_code: BizCode,
msg: str,
http_status: int = 400,
data: Any = None
) -> JSONResponse:
"""返回失败的 JSONResponse"""
return JSONResponse(
status_code=http_status,
content=ApiResponse(code=biz_code, msg=msg, data=data).model_dump()
)
"""
数据验证模块
"""
import re
import logging
logger = logging.getLogger(__name__)
def validate_excel_url(url: str) -> bool:
"""
验证 Excel URL 的有效性
Args:
url: URL 字符串
Returns:
bool: 是否为有效的 Excel URL
"""
if not url or not isinstance(url, str):
return False
# 检查 URL 格式
url_pattern = r'^https?://.*\.(xlsx|xls|csv)$'
if not re.match(url_pattern, url, re.IGNORECASE):
logger.warning(f"URL 格式无效: {url}")
return False
return True
def sanitize_filename(filename: str) -> str:
"""
清理文件名,移除不安全的字符
Args:
filename: 原始文件名
Returns:
str: 清理后的文件名
"""
# 移除不安全字符
sanitized = re.sub(r'[<>:"/\\|?*]', '', filename)
return sanitized[:255] # 限制长度
def validate_table_name(table_name: str) -> bool:
"""
验证数据库表名的有效性
Args:
table_name: 表名
Returns:
bool: 是否为有效的表名
"""
if not table_name or not isinstance(table_name, str):
return False
# MySQL 表名规则:以字母、数字或下划线开头,不包含特殊字符
table_name_pattern = r'^[a-zA-Z_][a-zA-Z0-9_]{0,63}$'
if not re.match(table_name_pattern, table_name):
logger.warning(f"表名格式无效: {table_name}")
return False
return True
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论