提交 fa6c4c14 authored 作者: lidongxu's avatar lidongxu

新建项目

上级
# ========== 环境与敏感信息 ==========
.env
.env.local
.env.*.local
*.env
# ========== Python ==========
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
# 虚拟环境
venv/
.venv/
env/
ENV/
# ========== 测试与覆盖率 ==========
.pytest_cache/
.coverage
htmlcov/
.tox/
.nox/
coverage.xml
*.cover
.hypothesis/
# ========== IDE / 编辑器 ==========
.idea/
.vscode/
*.swp
*.swo
*~
.project
.pydevproject
.settings/
# ========== 系统文件 ==========
.DS_Store
.DS_Store?
Thumbs.db
ehthumbs.db
Desktop.ini
# ========== 日志与临时 ==========
*.log
*.tmp
*.temp
.cache/
# ========== 其他 ==========
*.sql.backup
*.bak
# 数据清洗系统 - 项目说明文档
## 项目概述
本项目是一个使用 FastAPI 框架开发的数据清洗系统,支持从 Excel 文件中提取数据、进行数据清洗处理,并将最终结果保存到 MySQL 数据库。
### 核心功能
1. **Excel 数据解析**:从网络链接下载并解析 Excel 文件
2. **数据清洗处理**:对解析后的数据进行验证、清洗和去重
3. **进度反馈**:通过 HTTP 轮询方式向前端实时反馈数据清洗进度
4. **数据持久化**:将清洗后的数据保存到 MySQL 数据库
---
## 项目结构
```
clean_data/
├── index.py # 主程序入口
├── requirements.txt # 项目依赖列表
├── .env.example # 环境变量配置示例
├── README.md # 项目说明文档
├── core/ # 核心业务模块
│ ├── __init__.py
│ ├── excel_handler.py # Excel 文件处理
│ ├── data_cleaner.py # 数据清洗逻辑
│ ├── db_handler.py # 数据库交互
│ └── progress_manager.py # 进度管理
└── utils/ # 工具模块
├── __init__.py
├── exceptions.py # 自定义异常
└── validators.py # 数据验证
```
---
## 快速开始
### 1. 环境准备
```bash
# 克隆项目(如果需要)
cd clean_data
# 创建虚拟环境(推荐)
python -m venv venv
# 激活虚拟环境
# Windows:
venv\Scripts\activate
# Linux/Mac:
source venv/bin/activate
# 安装依赖
pip install -r requirements.txt
```
### 2. 配置环境变量
```bash
# 复制环境变量配置文件
cp .env.example .env
# 编辑 .env 文件,填写实际的配置信息
# 特别注意:
# - DB_HOST, DB_PORT, DB_USER, DB_PASSWORD 需要填写实际的数据库配置
# - DB_NAME 为要使用的数据库名称
```
### 3. 启动服务
```bash
# 方式一:使用 Python 直接运行
python index.py
# 方式二:使用 Uvicorn 运行(推荐)
uvicorn index:app --host 0.0.0.0 --port 8000 --reload
# 服务将在 http://0.0.0.0:8000 启动
# API 文档:http://localhost:8000/docs(Swagger UI)
```
---
## API 接口文档
### 1. 启动数据清洗任务
**请求**
```
POST /api/v1/clean
```
**请求体**
```json
{
"excel_url": "https://example.com/data.xlsx",
"department": "sales",
"description": "Q1销售数据清洗"
}
```
**响应**
```json
{
"task_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "queued",
"message": "任务已创建,正在处理中...",
"data_preview": null
}
```
### 2. 获取数据清洗进度
**请求**
```
GET /api/v1/progress/{task_id}
```
**响应**
```json
{
"task_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "processing",
"progress": 65,
"message": "已清洗 650/1000 行数据",
"timestamp": "2026-03-06T10:30:45.123456"
}
```
**状态说明**
- `queued`: 任务已创建,排队中
- `processing`: 数据正在处理中
- `completed`: 数据清洗完成
- `failed`: 清洗过程中出错
### 3. 获取清洗结果
**请求**
```
GET /api/v1/result/{task_id}
```
**响应**
```json
{
"task_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "ready_to_save",
"message": "数据清洗完成,可进行保存",
"data_preview": [
{"产品": "产品A", "金额": 1000},
{"产品": "产品B", "金额": 2000}
],
"total_rows": 1000,
"department": "sales"
}
```
### 4. 保存清洗后的数据
**请求**
```
POST /api/v1/save
```
**请求体**
```json
{
"task_id": "550e8400-e29b-41d4-a716-446655440000",
"table_name": "sales_data"
}
```
**响应**
```json
{
"task_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "saved",
"message": "数据已成功保存到数据库",
"affected_rows": 1000
}
```
### 5. 健康检查
**请求**
```
GET /api/v1/health
```
**响应**
```json
{
"status": "healthy",
"timestamp": "2026-03-06T10:30:45.123456",
"service": "数据清洗系统"
}
```
---
## 进度反馈机制
### HTTP 轮询方案(无需 WebSocket)
系统采用 **HTTP 轮询** 方式实现进度反馈,具有以下优势:
1. **无连接保持**:客户端主动请求,降低服务器负载
2. **兼容性强**:支持所有 HTTP 客户端
3. **易于部署**:无需 WebSocket 基础设施
4. **便于扩展**:易于部署到各种云环境
### 前端实现建议
```javascript
// 示例:React/Vue 前端逻辑
const pollProgress = async (taskId) => {
const interval = setInterval(async () => {
try {
const response = await fetch(`/api/v1/progress/${taskId}`);
const data = await response.json();
// 更新进度条
updateProgressBar(data.progress);
updateMessage(data.message);
// 任务完成时停止轮询
if (data.status === 'completed' || data.status === 'failed') {
clearInterval(interval);
}
} catch (error) {
console.error('获取进度失败:', error);
}
}, 1000); // 每秒轮询一次
};
```
---
## 数据清洗逻辑
### 清洗步骤
1. **下载**:从网络链接下载 Excel 文件
2. **解析**:使用 openpyxl 解析 Excel 内容
3. **验证**:验证数据类型和必填字段
4. **清洗**
- 移除首尾空格
- 处理空值
- 去重处理
5. **缓存**:将清洗后的数据存储在内存中
6. **保存**:前端确认后保存到数据库
### 自定义清洗规则
编辑 `core/data_cleaner.py` 中的 `_validate_required_fields` 方法来自定义不同部门的清洗规则:
```python
required_fields_map = {
'sales': ['产品', '金额', '销售日期'],
'inventory': ['SKU', '数量', '仓库'],
'finance': ['交易日期', '金额', '类别']
}
```
---
## 数据库配置
### MySQL 5.6+ 连接配置
编辑 `.env` 文件:
```ini
DB_HOST=localhost
DB_PORT=3306
DB_USER=root
DB_PASSWORD=your_password
DB_NAME=clean_data
```
### 创建目标表(示例)
```sql
CREATE TABLE sales_data (
id INT AUTO_INCREMENT PRIMARY KEY,
产品 VARCHAR(100),
金额 DECIMAL(10, 2),
销售日期 DATE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
```
---
## 异常处理
系统定义了多种自定义异常,便于错误追踪:
- **DataCleaningException**:数据清洗过程中的异常
- **DatabaseException**:数据库操作异常
- **ExcelParsingException**:Excel 解析异常
- **ValidationException**:数据验证异常
所有异常都会被记录到日志中,便于问题排查。
---
## 日志记录
系统使用 Python 标准 logging 模块记录所有操作,日志级别可在 `.env` 中配置:
```
LOG_LEVEL=INFO
LOG_FILE=./logs/app.log
```
日志记录内容包括:
- 任务创建和完成
- 数据处理进度
- 异常错误信息
- 数据库操作记录
---
## 性能优化建议
1. **批量插入**:数据库操作使用批量插入(默认每 1000 行为一批)
2. **异步处理**:使用 FastAPI 的后台任务避免阻塞响应
3. **进度缓存**:使用内存字典缓存进度数据和清洗结果
4. **连接池**:建议为数据库使用连接池(可扩展功能)
---
## 常见问题
### Q: 为什么不使用 WebSocket?
A: HTTP 轮询方案具有以下优势:
- 服务器不需要维持连接状态
- 更容易水平扩展
- 无需 WebSocket 库和基础设施
- 使用标准 HTTP 协议,兼容性更强
### Q: 清洗后的数据存储在哪里?
A: 清洗后的数据默认存储在:
- **短期**:服务器内存中(task_id 映射)
- **长期**:用户确认后保存到 MySQL 数据库
### Q: 如何处理大文件?
A: 可在 `.env` 中配置最大文件大小限制:
```
MAX_EXCEL_SIZE=52428800 # 50MB
```
---
## 扩展功能(可选)
1. **数据备份**:定期备份已保存的数据
2. **审计日志**:记录所有数据修改操作
3. **权限控制**:添加用户认证和授权机制
4. **缓存优化**:使用 Redis 替代内存缓存
5. **任务队列**:使用 Celery 处理大批量任务
---
## 部署建议
### 生产环境
1. 使用 Gunicorn + Uvicorn 运行应用
2. 配置反向代理(nginx)
3. 启用 HTTPS
4. 配置日志持久化
5. 设置监控告警
### Docker 部署
```dockerfile
FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["uvicorn", "index:app", "--host", "0.0.0.0", "--port", "8000"]
```
---
## 技术栈
- **Web 框架**:FastAPI 0.104.1
- **ASGI 服务器**:Uvicorn 0.24.0
- **Excel 处理**:openpyxl 3.10.10
- **数据库驱动**:mysql-connector-python 8.2.0
- **数据验证**:Pydantic 2.5.0
- **HTTP 客户端**:requests 2.31.0
---
## License
MIT
---
## 支持
如有任何问题或建议,请联系开发团队。
"""
配置管理模块
负责读取和管理应用配置
"""
import os
from typing import Optional
from dotenv import load_dotenv
# 加载 .env 文件
load_dotenv()
class Config:
"""应用配置类"""
# 服务器配置
HOST: str = os.getenv("HOST", "0.0.0.0")
PORT: int = int(os.getenv("PORT", "8000"))
DEBUG: bool = os.getenv("DEBUG", "False").lower() == "true"
# 数据库配置
DB_HOST: str = os.getenv("DB_HOST", "localhost")
DB_PORT: int = int(os.getenv("DB_PORT", "3306"))
DB_USER: str = os.getenv("DB_USER", "root")
DB_PASSWORD: str = os.getenv("DB_PASSWORD", "")
DB_NAME: str = os.getenv("DB_NAME", "clean_data")
# 日志配置
LOG_LEVEL: str = os.getenv("LOG_LEVEL", "INFO")
LOG_FILE: Optional[str] = os.getenv("LOG_FILE")
# Excel 下载配置
EXCEL_DOWNLOAD_TIMEOUT: int = int(os.getenv("EXCEL_DOWNLOAD_TIMEOUT", "30"))
MAX_EXCEL_SIZE: int = int(os.getenv("MAX_EXCEL_SIZE", "52428800")) # 50MB
# 任务超时配置
TASK_TIMEOUT_SECONDS: int = int(os.getenv("TASK_TIMEOUT_SECONDS", "3600")) # 1小时
@classmethod
def get_db_config(cls) -> dict:
"""获取数据库配置字典"""
return {
'host': cls.DB_HOST,
'port': cls.DB_PORT,
'user': cls.DB_USER,
'password': cls.DB_PASSWORD,
'database': cls.DB_NAME,
}
# 创建全局配置实例
config = Config()
"""Core 业务模块"""
"""
数据清洗模块
负责数据的清洗和验证逻辑
"""
import logging
import asyncio
import pandas as pd
from typing import List, Dict, Any, Callable, Optional
logger = logging.getLogger(__name__)
# 各 department 对应的清洗策略注册表
# key: department 名称, value: (transform函数, 产品组配置, 稽查来源名称)
_DEPARTMENT_CLEANERS = {}
def _load_department_cleaners():
"""非专用清洗逻辑"""
global _DEPARTMENT_CLEANERS
if _DEPARTMENT_CLEANERS: # 如果部门清洗模块已加载,则直接返回
return
try:
# 加载部门清洗使用的工具
from core_py.数据转换_团队 import (
transform as _team_transform,
PRODUCT_GROUPS_JC,
) # PRODUCT_GROUPS_JC 风控稽查数据清洗配置数据
_DEPARTMENT_CLEANERS["风控稽查数据清洗"] = (_team_transform, PRODUCT_GROUPS_JC, "稽查团队")
logger.info("已加载部门清洗模块: 风控稽查数据清洗")
except ImportError as e:
logger.warning(f"加载团队清洗模块失败: {e}")
class DataCleaner:
"""数据清洗类"""
def __init__(self):
self.rules = {}
async def clean(
self,
raw_data: List[Dict[str, Any]],
department: str,
progress_callback: Optional[Callable[[float, str, Optional[int]], None]] = None,
audit_date: Optional[str] = None,
) -> List[Dict[str, Any]]:
"""
清洗数据
Args:
raw_data: 原始数据列表(每行为 dict,key 为列名)
department: 业务部门名称,如 "团队"
progress_callback: 进度回调函数,接收 (progress: 0-1, message: str)
audit_date: 稽查日期字符串,格式 'yyyy-mm-dd';为 None 时由各清洗模块自动取上月1号
Returns:
List[Dict]: 清洗后的数据
"""
try:
logger.info(f"开始清洗数据,部门: {department},数据行数: {len(raw_data)}")
# ── 专项清洗路由 ──────────────────────────────────────────────
_load_department_cleaners()
if department in _DEPARTMENT_CLEANERS:
return await self._clean_by_department(
raw_data, department, progress_callback, audit_date=audit_date
)
# ─────────────────────────────────────────────────────────────
total_rows = len(raw_data)
cleaned_data = []
for idx, row in enumerate(raw_data):
try:
cleaned_row = await self._validate_and_convert(row, department)
if cleaned_row and not self._is_duplicate(
cleaned_row, cleaned_data
):
cleaned_data.append(cleaned_row)
if progress_callback and idx % max(1, total_rows // 10) == 0:
progress = idx / total_rows if total_rows > 0 else 0
progress_callback(progress, f"已清洗 {idx}/{total_rows} 行数据", len(cleaned_data))
except Exception as e:
logger.warning(f"第 {idx + 1} 行数据清洗失败: {str(e)}")
continue
if progress_callback:
progress_callback(1.0, f"清洗完成,共 {len(cleaned_data)} 行有效数据", len(cleaned_data))
logger.info(
f"数据清洗完成,原始行数: {total_rows},清洗后行数: {len(cleaned_data)}"
)
return cleaned_data
except Exception as e:
logger.error(f"clean 方法执行失败: {str(e)}")
raise
async def _clean_by_department(
self,
raw_data: List[Dict[str, Any]],
department: str,
progress_callback: Optional[Callable[[float, str, Optional[int]], None]] = None,
audit_date: Optional[str] = None,
) -> List[Dict[str, Any]]:
"""
调用对应部门的专项 transform 函数进行清洗。
raw_data 来自 excel_handler(List[Dict],key 为列名),
transform 函数通过 iloc 按位置访问列,因此转换为 DataFrame 时
只要列顺序与原始 Excel 一致,iloc 索引就能正确对应。
"""
transform_fn, pg, yname = _DEPARTMENT_CLEANERS[department]
if progress_callback:
progress_callback(0.1, "正在转换数据格式", None)
# List[Dict] → DataFrame(保留原始列顺序,iloc 索引与 Excel 列位置对应)
df = pd.DataFrame(raw_data)
if progress_callback:
progress_callback(0.3, f"正在执行 {department} 数据清洗", None)
# transform 是同步函数,用 asyncio.to_thread 避免阻塞事件循环
records = await asyncio.to_thread(transform_fn, df, yname, pg, audit_date)
if progress_callback:
progress_callback(1.0, f"清洗完成,共 {len(records)} 行有效数据", len(records))
logger.info(f"[{department}] 专项清洗完成,共 {len(records)} 条记录")
return records
async def _validate_and_convert(
self, row: Dict[str, Any], department: str
) -> Optional[Dict[str, Any]]:
"""
验证和转换单行数据
Args:
row: 数据行
department: 业务部门名称
Returns:
转换后的数据行,若无效则返回 None
"""
try:
cleaned_row = {}
for key, value in row.items():
if value is None or (isinstance(value, str) and not value.strip()):
# 空值处理
cleaned_row[key] = None
continue
# 字符串数据清洗
if isinstance(value, str):
cleaned_row[key] = value.strip()
else:
cleaned_row[key] = value
# 验证必填字段(根据部门调整规则)
if not self._validate_required_fields(cleaned_row, department):
return None
return cleaned_row
except Exception as e:
logger.warning(f"_validate_and_convert 失败: {str(e)}")
return None
def _validate_required_fields(self, row: Dict[str, Any], department: str) -> bool:
"""
验证必填字段
Args:
row: 数据行
department: 业务部门
Returns:
bool: 是否通过验证
"""
# 示例:可根据部门定义不同的必填字段规则
required_fields_map = {
"sales": ["产品", "金额"],
"inventory": ["SKU", "数量"],
"finance": ["交易日期", "金额"],
}
required_fields = required_fields_map.get(department, [])
# 检查必填字段是否存在且非空
for field in required_fields:
if field not in row or row[field] is None:
return False
return True
def _is_duplicate(
self, row: Dict[str, Any], existing_data: List[Dict[str, Any]]
) -> bool:
"""
检查行是否为重复数据
Args:
row: 当前行
existing_data: 已有数据列表
Returns:
bool: 是否为重复
"""
# 简单的重复检查(可扩展为更复杂的逻辑)
for existing_row in existing_data:
if row == existing_row:
return True
return False
"""
数据库处理模块
负责与 MySQL 数据库的交互
"""
import logging
import mysql.connector
from typing import List, Dict, Any
import os
from contextlib import contextmanager
logger = logging.getLogger(__name__)
class DatabaseHandler:
"""数据库处理类"""
def __init__(self):
"""初始化数据库配置"""
self.db_config = {
'host': os.getenv('DB_HOST', 'localhost'),
'user': os.getenv('DB_USER', 'root'),
'password': os.getenv('DB_PASSWORD', ''),
'database': os.getenv('DB_NAME', 'clean_data'),
'port': int(os.getenv('DB_PORT', 3306)),
'autocommit': False,
'connection_timeout': 10
}
@contextmanager
def _get_connection(self):
"""
获取数据库连接的上下文管理器
Yields:
mysql.connector.MySQLConnection: 数据库连接
Raises:
Exception: 连接失败时抛出异常
"""
connection = None
try:
connection = mysql.connector.connect(**self.db_config)
logger.info("数据库连接成功")
yield connection
except mysql.connector.Error as e:
logger.error(f"数据库连接失败: {str(e)}")
raise
finally:
if connection and connection.is_connected():
connection.close()
logger.info("数据库连接已关闭")
async def insert_data(
self,
table_name: str,
data: List[Dict[str, Any]]
) -> int:
"""
将数据插入到指定的表
Args:
table_name: 目标表名
data: 数据列表
Returns:
int: 受影响的行数
Raises:
Exception: 插入失败时抛出异常
"""
if not data:
logger.warning("插入的数据为空")
return 0
try:
with self._get_connection() as connection:
cursor = connection.cursor()
# 获取字段名
columns = list(data[0].keys())
column_names = ', '.join([f'`{col}`' for col in columns])
placeholders = ', '.join(['%s'] * len(columns))
insert_sql = f"""
INSERT INTO `{table_name}` ({column_names})
VALUES ({placeholders})
"""
logger.info(f"准备插入 {len(data)} 行数据到表 {table_name}")
# 批量插入数据
for batch_start in range(0, len(data), 1000):
batch_end = min(batch_start + 1000, len(data))
batch_data = data[batch_start:batch_end]
# 准备批次数据
values_list = []
for row in batch_data:
values = tuple(row.get(col) for col in columns)
values_list.append(values)
# 执行批量插入
cursor.executemany(insert_sql, values_list)
logger.info(f"已插入 {batch_end} / {len(data)} 行数据")
connection.commit()
affected_rows = cursor.rowcount
cursor.close()
logger.info(f"成功插入 {affected_rows} 行数据到 {table_name}")
return affected_rows
except mysql.connector.Error as e:
logger.error(f"MySQL 错误: {str(e)}")
raise
except Exception as e:
logger.error(f"insert_data 失败: {str(e)}")
raise
async def test_connection(self) -> bool:
"""
测试数据库连接
Returns:
bool: 连接是否成功
"""
try:
with self._get_connection() as connection:
cursor = connection.cursor()
cursor.execute("SELECT 1")
cursor.fetchone()
cursor.close()
return True
except Exception as e:
logger.error(f"数据库连接测试失败: {str(e)}")
return False
async def create_table_if_not_exists(
self,
table_name: str,
schema: Dict[str, str]
) -> bool:
"""
如果表不存在则创建表
Args:
table_name: 表名
schema: 表架构定义 {列名: 列定义}
Returns:
bool: 是否创建成功或表已存在
"""
try:
with self._get_connection() as connection:
cursor = connection.cursor()
# 检查表是否存在
cursor.execute(f"""
SELECT TABLE_NAME FROM information_schema.TABLES
WHERE TABLE_SCHEMA = '{self.db_config['database']}'
AND TABLE_NAME = '{table_name}'
""")
if cursor.fetchone():
logger.info(f"表 {table_name} 已存在")
cursor.close()
return True
# 创建表
columns_sql = ', '.join([f'`{col}` {definition}' for col, definition in schema.items()])
create_sql = f"""
CREATE TABLE `{table_name}` (
id INT AUTO_INCREMENT PRIMARY KEY,
{columns_sql},
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
"""
cursor.execute(create_sql)
connection.commit()
cursor.close()
logger.info(f"成功创建表 {table_name}")
return True
except Exception as e:
logger.error(f"create_table_if_not_exists 失败: {str(e)}")
raise
"""
Excel 文件处理模块
负责从 URL 下载和解析 Excel 文件
"""
import aiohttp
import logging
from openpyxl import load_workbook
from io import BytesIO
from typing import List, Dict, Any
import os
import tempfile
logger = logging.getLogger(__name__)
class ExcelHandler:
"""Excel 文件处理类"""
def __init__(self):
self.timeout = aiohttp.ClientTimeout(total=30)
async def fetch_bytes(self, url: str) -> bytes:
"""
从 URL 下载文件,返回原始字节内容(供调用方自行用 pandas 解析)
Args:
url: 文件的网络链接
Returns:
bytes: 文件的原始二进制内容
"""
try:
logger.info(f"开始从 {url} 下载文件")
async with aiohttp.ClientSession(timeout=self.timeout) as session:
async with session.get(url) as response:
if response.status != 200:
raise Exception(f"下载失败,HTTP 状态码: {response.status}")
content = await response.read()
logger.info(f"下载完成,文件大小: {len(content)} 字节")
return content
except Exception as e:
logger.error(f"fetch_bytes 失败: {str(e)}")
raise
async def fetch_and_parse(self, excel_url: str) -> List[Dict[str, Any]]:
"""
从 URL 下载并解析 Excel 文件
Args:
excel_url: Excel 文件的网络链接
Returns:
List[Dict]: 解析后的数据,每行为一个字典
Raises:
Exception: 下载或解析失败时抛出异常
"""
try:
# 1. 下载文件
logger.info(f"开始从 {excel_url} 下载 Excel 文件")
async with aiohttp.ClientSession(timeout=self.timeout) as session:
async with session.get(excel_url) as response:
if response.status != 200:
raise Exception(f"下载失败,HTTP 状态码: {response.status}")
excel_content = await response.read()
logger.info(f"下载完成,文件大小: {len(excel_content)} 字节")
# 2. 解析 Excel
return self._parse_excel_content(excel_content)
except Exception as e:
logger.error(f"fetch_and_parse 失败: {str(e)}")
raise
def _parse_excel_content(self, excel_content: bytes) -> List[Dict[str, Any]]:
"""
解析 Excel 内容
Args:
excel_content: Excel 文件的二进制内容
Returns:
List[Dict]: 解析后的数据
"""
try:
# 使用 BytesIO 从内存中读取
excel_file = BytesIO(excel_content)
workbook = load_workbook(excel_file)
# 获取第一个工作表
worksheet = workbook.active
if not worksheet:
raise Exception("Excel 文件不包含有效的工作表")
# 获取标题行
headers = []
for cell in worksheet[1]:
headers.append(cell.value)
if not headers or all(h is None for h in headers):
raise Exception("Excel 文件不包含有效的标题行")
# 解析数据行
data = []
for row in worksheet.iter_rows(min_row=2, values_only=False):
row_data = {}
for idx, cell in enumerate(row):
if idx < len(headers):
row_data[headers[idx]] = cell.value
# 跳过空行
if any(v is not None for v in row_data.values()):
data.append(row_data)
logger.info(f"成功解析 Excel,共 {len(data)} 行数据")
return data
except Exception as e:
logger.error(f"_parse_excel_content 失败: {str(e)}")
raise
"""
进度管理模块
负责任务进度的记录和查询
"""
import logging
from typing import Dict, Any, Optional
from datetime import datetime, timedelta
import threading
logger = logging.getLogger(__name__)
class ProgressManager:
"""进度管理类"""
def __init__(self, timeout_seconds: int = 3600):
"""
初始化进度管理器
Args:
timeout_seconds: 任务进度的过期时间(秒),默认 1 小时
"""
self.progress_data: Dict[str, Dict[str, Any]] = {}
self.timeout_seconds = timeout_seconds
self.lock = threading.Lock()
def update_progress(
self,
task_id: str,
status: str,
progress: int,
message: str,
processed_count: Optional[int] = None
) -> None:
"""
更新任务进度
Args:
task_id: 任务唯一标识
status: 状态 (queued, processing, completed, failed)
progress: 进度百分比 (0-100)
message: 进度信息
processed_count: 已处理的数据条数,None 表示暂未统计
"""
with self.lock:
self.progress_data[task_id] = {
'task_id': task_id,
'status': status,
'progress': max(0, min(100, progress)),
'message': message,
'processed_count': processed_count,
'timestamp': datetime.now().isoformat(),
'created_at': datetime.now()
}
logger.debug(f"[{task_id}] 进度更新: {status} {progress}% - {message}")
def get_progress(self, task_id: str) -> Optional[Dict[str, Any]]:
"""
获取任务进度
Args:
task_id: 任务唯一标识
Returns:
Optional[Dict]: 进度信息,若任务不存在或已过期返回 None
"""
with self.lock:
if task_id not in self.progress_data:
return None
data = self.progress_data[task_id]
# 检查是否过期
if datetime.now() - data['created_at'] > timedelta(seconds=self.timeout_seconds):
logger.warning(f"任务 {task_id} 已过期,删除记录")
del self.progress_data[task_id]
return None
# 返回字典副本,移除 created_at(内部字段)
result = {k: v for k, v in data.items() if k != 'created_at'}
return result
def get_all_progress(self) -> Dict[str, Dict[str, Any]]:
"""
获取所有任务的进度信息
Returns:
Dict: 所有任务的进度信息
"""
with self.lock:
# 清理过期任务
expired_tasks = []
for task_id, data in self.progress_data.items():
if datetime.now() - data['created_at'] > timedelta(seconds=self.timeout_seconds):
expired_tasks.append(task_id)
for task_id in expired_tasks:
del self.progress_data[task_id]
logger.info(f"清理过期任务: {task_id}")
# 返回所有有效任务的进度
return {
task_id: {k: v for k, v in data.items() if k != 'created_at'}
for task_id, data in self.progress_data.items()
}
def clear_progress(self, task_id: str) -> None:
"""
清除任务进度记录
Args:
task_id: 任务唯一标识
"""
with self.lock:
if task_id in self.progress_data:
del self.progress_data[task_id]
logger.info(f"清除任务 {task_id} 的进度记录")
import pandas as pd
from datetime import datetime
from dateutil.relativedelta import relativedelta
# 文件路径
# TODO: 配置稽查月份(默认1号)
current_date = (datetime.now().replace(day=1) - relativedelta(months=1)).strftime("%Y-%m-01")
y_file = f"/王小卤/风控/代码-新/大日期{current_date}_2.xlsx"
p_file = f"/王小卤/风控/代码-新//线下价盘表2601版.xlsx"
# 保存回原文件(建议先保存为新文件以防覆盖)
output_file = f"/王小卤/风控/代码-新//低价大日期_2.xlsx"
# 读取Y表(稽查结果表)
df_y = pd.read_excel(y_file,sheet_name='合并后', dtype=str) # 先以字符串读入避免格式问题,后续转数字
# 读取P表(价盘表)
df_p = pd.read_excel(p_file, dtype=str)
# 清理列名(去除前后空格等)
df_y.columns = df_y.columns.str.strip()
df_p.columns = df_p.columns.str.strip()
# 将关键字段转换为统一格式(去除空格、统一大小写等,便于匹配)
def clean_str(s):
if pd.isna(s):
return ""
return str(s).strip().upper()
# 对Y表的关键列清洗
df_y['产品系列_clean'] = df_y.iloc[:, 14].apply(clean_str) # O列:产品系列
df_y['产品克重_clean'] = df_y.iloc[:, 16].apply(clean_str) # Q列:产品克重
df_y['渠道类型_clean'] = df_y.iloc[:, 13].apply(clean_str) # N列:渠道类型(稽查源提供)
# 对P表的关键列清洗
df_p['产品系统_clean'] = df_p.iloc[:, 0].apply(clean_str) # A列:产品系统
df_p['产品克重_p_clean'] = df_p.iloc[:, 2].apply(clean_str) # C列:产品克重
df_p['渠道_p_clean'] = df_p.iloc[:, 3].apply(clean_str) # D列:渠道
# 将价格列转为数值类型(注意处理非数字情况)
df_y['产品价格_num'] = pd.to_numeric(df_y.iloc[:, 17], errors='coerce') # R列:产品价格
df_p['低价_num'] = pd.to_numeric(df_p.iloc[:, 4], errors='coerce') # E列:低价
# 构建P表的唯一键(产品系统 + 产品克重 + 渠道)
df_p['match_key'] = df_p['产品系统_clean'] + '|' + df_p['产品克重_p_clean'] + '|' + df_p['渠道_p_clean']
# 构建Y表的匹配键(产品系列 + 产品克重 + 渠道类型)
df_y['match_key'] = df_y['产品系列_clean'] + '|' + df_y['产品克重_clean'] + '|' + df_y['渠道类型_clean']
# 将P表转为字典:key -> 低价
price_map = df_p.set_index('match_key')['低价_num'].to_dict()
# 初始化Y表的目标列(S: 是否低价, T: 破价价差)
df_y['是否低价'] = '正常' # 默认值
df_y['破价价差'] = None
# 遍历Y表每一行进行匹配和判断
for idx, row in df_y.iterrows():
key = row['match_key']
y_price = row['产品价格_num']
p_low_price = price_map.get(key, None)
if pd.notna(y_price) and pd.notna(p_low_price):
if y_price < p_low_price:
df_y.at[idx, '是否低价'] = '低价'
df_y.at[idx, '破价价差'] = round(p_low_price - y_price, 2)
df_y.at[idx, '低价整改状态'] = '未整改'
else:
df_y.at[idx, '是否低价'] = '正常'
df_y.at[idx, '破价价差'] = None
else:
# 无法匹配或价格缺失,保留默认或标记
df_y.at[idx, '是否低价'] = None
df_y.at[idx, '破价价差'] = None
# 只保留原始列(不保留清洗用的辅助列)
original_columns = df_y.columns.tolist()
output_columns = [col for col in original_columns if not col.endswith('_clean') and col not in ['产品价格_num', 'match_key']]
df_y[output_columns].to_excel(output_file, index=False)
print(f"处理完成!结果已保存至:{output_file}")
\ No newline at end of file
import pandas as pd
import copy
import os
from datetime import datetime
from dateutil.relativedelta import relativedelta
# === 本地独立运行配置(仅 __main__ 模式使用)===
source_file = "/王小卤/风控/代码-新//2026.2-团队数据源.xlsx"
def _get_default_audit_date() -> str:
"""返回上月1号作为默认稽查日期,格式 yyyy-mm-01"""
return (datetime.now().replace(day=1) - relativedelta(months=1)).strftime("%Y-%m-01")
# 列映射(目标表列名)
COLUMN_MAPPING = {
"稽查日期": "稽查日期",
"稽查来源": "稽查来源",
"勤策门店编码": "勤策门店编码",
"勤策门店名称": "勤策门店名称",
"经销商名称": "经销商名称",
"城市": "城市",
"渠道类型": "渠道类型(稽查源提供)",
"产品系列": "产品系列",
"产品口味": "产品口味",
"产品克重": "产品克重",
"产品价格": "产品价格",
"产品生产月份": "产品生产月份",
}
# ===== 新增:多产品组配置 =====
# 每组:价格列 + 7个口味列 + 产品信息
# 团队表
PRODUCT_GROUPS_JC = [
# 第1组:虎皮凤爪 210g
{
"price_col": 50,
"flavor_cols": [51, 52, 53, 54, 55, 56, 57],
"series": "虎皮凤爪",
"weight": "210g",
"flavors": ["卤香", "香辣", "椒麻", "火锅", "微辣", "麻辣", "黑鸭"]
},
# 第2组:虎皮凤爪 105g
{
"price_col": 58,
"flavor_cols": [59, 60, 61, 62, 63, 64, 65],
"series": "虎皮凤爪",
"weight": "105g",
"flavors": ["卤香", "香辣", "椒麻", "火锅", "微辣", "麻辣", "黑鸭"]
},
# 第3组:虎皮凤爪 68g
{
"price_col": 66,
"flavor_cols": [67, 68, 69, 70, 71],
"series": "虎皮凤爪",
"weight": "68g",
"flavors": ["卤香", "香辣", "椒麻", "麻辣", "黑鸭"]
},
# 第4组:鸡肉豆堡 120g
{
"price_col": 72,
"flavor_cols": [73, 74],
"series": "鸡肉豆堡",
"weight": "120g",
"flavors": ["卤香", "香辣"]
},
# 第5组:牛肉豆堡 120g
{
"price_col": 75,
"flavor_cols": [76, 77],
"series": "牛肉豆堡",
"weight": "120g",
"flavors": ["卤香", "香辣"]
},
# 第6组:去骨凤爪 72g
{
"price_col": 78,
"flavor_cols": [79, 80],
"series": "去骨凤爪",
"weight": "72g",
"flavors": ["柠檬", "香辣"]
},
# 第7组:去骨凤爪 138g
{
"price_col": 81,
"flavor_cols": [82, 83],
"series": "去骨凤爪",
"weight": "138g",
"flavors": ["柠檬", "香辣"]
},
# 第8组:虎皮小鸡腿 80g
{
"price_col": 84,
"flavor_cols": [85, 86],
"series": "虎皮小鸡腿",
"weight": "80g",
"flavors": ["卤香", "香辣"]
},
# 第9组:老卤凤爪 95g(与老卤鸭掌共用 price_col=87)
{
"price_col": 87,
"flavor_cols": [88],
"series": "老卤凤爪",
"weight": "95g",
"flavors": ["卤香"]
},
# 第10组:老卤鸭掌 95g(与老卤凤爪共用 price_col=87)
{
"price_col": 87,
"flavor_cols": [89],
"series": "老卤鸭掌",
"weight": "95g",
"flavors": ["卤香"]
},
# 第11组:虎皮凤爪 25g
{
"price_col": 90,
"flavor_cols": [91, 92],
"series": "虎皮凤爪",
"weight": "25g",
"flavors": ["卤香", "香辣"]
},
# 第12组:虎皮凤爪 散称
{
"price_col": 93,
"flavor_cols": [94, 95, 96],
"series": "虎皮凤爪",
"weight": "散称",
"flavors": ["卤香", "香辣", "黑鸭"]
}
]
# 标准输出列定义(与目标表结构保持一致)
STANDARD_COLUMNS = [
"稽查日期", "稽查来源", "大区", "战区", "经销商编码", "经销商名称",
"勤策门店编码", "勤策门店名称", "客户经理工号", "客户经理",
"勤策渠道大类", "稽核渠道(对N列清洗)", "城市", "渠道类型(稽查源提供)",
"产品系列", "产品口味", "产品克重", "产品价格", "是否低价", "破价价差", "低价整改状态",
"低价整改说明", "产品生产月份", "临期月份数", "临期状态", "新鲜度",
"大日期整改状态", "大日期整改说明"
]
def _build_records(df_source, yname, pg, existing_columns, audit_date: str = None):
"""
核心记录构建逻辑,供 transform() 和 main() 复用。
Args:
df_source: pandas DataFrame,列通过 iloc 按位置访问
yname: 稽查来源名称,如 '稽查团队'
pg: 产品组配置列表
existing_columns: 目标表的列名列表
audit_date: 稽查日期字符串,格式 'yyyy-mm-dd';为 None 时取上月1号
Returns:
list: 构建好的记录列表(每条为 dict)
"""
if audit_date is None:
audit_date = _get_default_audit_date()
records = []
for idx, row in df_source.iterrows():
base_data = {
"勤策门店编码": str(row.iloc[8]).strip() if pd.notna(row.iloc[8]) else "",
"城市": str(row.iloc[4]).strip() if pd.notna(row.iloc[4]) else "",
"勤策门店名称": str(row.iloc[9]).strip() if pd.notna(row.iloc[9]) else "",
"经销商名称": str(row.iloc[7]).strip() if pd.notna(row.iloc[7]) else "",
"渠道类型": str(row.iloc[10]).strip() if pd.notna(row.iloc[10]) else "",
}
base_row = {}
if COLUMN_MAPPING["稽查日期"] in existing_columns:
base_row[COLUMN_MAPPING["稽查日期"]] = audit_date
if COLUMN_MAPPING["稽查来源"] in existing_columns:
base_row[COLUMN_MAPPING["稽查来源"]] = yname
if COLUMN_MAPPING["勤策门店编码"] in existing_columns:
base_row[COLUMN_MAPPING["勤策门店编码"]] = base_data["勤策门店编码"]
if COLUMN_MAPPING["勤策门店名称"] in existing_columns:
base_row[COLUMN_MAPPING["勤策门店名称"]] = base_data["勤策门店名称"]
if COLUMN_MAPPING["经销商名称"] in existing_columns:
base_row[COLUMN_MAPPING["经销商名称"]] = base_data["经销商名称"]
if COLUMN_MAPPING["城市"] in existing_columns:
base_row[COLUMN_MAPPING["城市"]] = base_data["城市"]
if COLUMN_MAPPING["渠道类型"] in existing_columns:
base_row[COLUMN_MAPPING["渠道类型"]] = base_data["渠道类型"]
for group in pg:
price_col = group["price_col"]
flavor_cols = group["flavor_cols"]
flavors = group["flavors"]
series = group["series"]
weight = group["weight"]
src_price = str(row.iloc[price_col]).strip() if pd.notna(row.iloc[price_col]) else ""
if not src_price or src_price == '无价签':
src_price = ''
row_with_price = copy.deepcopy(base_row)
if COLUMN_MAPPING["产品价格"] in existing_columns:
row_with_price[COLUMN_MAPPING["产品价格"]] = src_price
for i, col_idx in enumerate(flavor_cols):
flavor_name = flavors[i]
src_month = str(row.iloc[col_idx]).strip() if pd.notna(row.iloc[col_idx]) else ""
if src_month:
new_rec = copy.deepcopy(row_with_price)
src_month = normalize_month(src_month)
_set_product_fields(new_rec, series, flavor_name, weight, src_month, existing_columns)
rDate(new_rec)
records.append(new_rec)
elif src_price:
new_rec = copy.deepcopy(row_with_price)
_set_product_fields(new_rec, series, flavor_name, weight, None, existing_columns)
rDate(new_rec)
records.append(new_rec)
return records
def transform(df_source, yname, pg, audit_date: str = None):
"""
供 API 调用的数据转换入口:接收 DataFrame,返回清洗后的记录列表,不读写任何文件。
Args:
df_source: pandas DataFrame,列通过 iloc 按位置访问(与原始 Excel 列顺序对应)
yname: 稽查来源名称,如 '稽查团队'
pg: 产品组配置列表
audit_date: 稽查日期字符串,格式 'yyyy-mm-dd';为 None 时自动取上月1号
Returns:
list[dict]: 按 STANDARD_COLUMNS 结构整理好的记录列表
"""
return _build_records(df_source, yname, pg, STANDARD_COLUMNS, audit_date=audit_date)
# === 主逻辑(独立运行/本地文件模式) ===
def main(df_source, yname, pg, audit_date: str = None):
if audit_date is None:
audit_date = _get_default_audit_date()
target_file = f"/王小卤/风控/代码-新/大日期{audit_date}_2.xlsx"
try:
# 获取目标表结构
try:
df_target = pd.read_excel(target_file, sheet_name="合并后", dtype=str)
existing_columns = df_target.columns.tolist()
except (FileNotFoundError, ValueError):
df_target = pd.DataFrame(columns=STANDARD_COLUMNS)
existing_columns = STANDARD_COLUMNS
records = _build_records(df_source, yname, pg, existing_columns, audit_date=audit_date)
if not records:
print("⚠️ 无有效数据需要追加。")
return
df_new = pd.DataFrame(records, columns=existing_columns)
df_combined = pd.concat([df_target, df_new], ignore_index=True)
if os.path.exists(target_file):
with pd.ExcelWriter(target_file, engine='openpyxl', mode='a', if_sheet_exists='replace') as writer:
df_combined.to_excel(writer, sheet_name="合并后", index=False)
else:
with pd.ExcelWriter(target_file, engine='openpyxl', mode='w') as writer:
df_combined.to_excel(writer, sheet_name="合并后", index=False)
print(f"✅ 成功追加 {len(records)} 条记录到目标表!")
except Exception as e:
print(f"❌ 错误: {e}")
import traceback
traceback.print_exc()
def _set_product_fields(record, series, flavor, weight, prod_month_str, existing_columns):
"""设置产品字段"""
if COLUMN_MAPPING["产品系列"] in existing_columns:
record[COLUMN_MAPPING["产品系列"]] = series
if COLUMN_MAPPING["产品口味"] in existing_columns:
record[COLUMN_MAPPING["产品口味"]] = flavor
if COLUMN_MAPPING["产品克重"] in existing_columns:
record[COLUMN_MAPPING["产品克重"]] = weight
if prod_month_str and COLUMN_MAPPING["产品生产月份"] in existing_columns:
try:
dt = datetime.strptime(prod_month_str, "%Y-%m-%d")
record[COLUMN_MAPPING["产品生产月份"]] = dt.strftime("%Y-%m-%d")
except (ValueError, TypeError):
record[COLUMN_MAPPING["产品生产月份"]] = None
def rDate(row_dict):
"""计算临期状态(保持你原有的业务逻辑)"""
prod_date_str = row_dict.get("产品生产月份", None)
inspect_date_str = row_dict.get("稽查日期", "").strip()
if not prod_date_str or not inspect_date_str:
row_dict["临期状态"] = ""
row_dict["新鲜度"] = ""
row_dict["临期月份数"] = ""
return
try:
prod_date = datetime.strptime(prod_date_str, "%Y-%m-%d")
inspect_date = datetime.strptime(inspect_date_str, "%Y-%m-%d")
except ValueError:
row_dict["临期状态"] = ""
row_dict["新鲜度"] = ""
row_dict["临期月份数"] = ""
return
product_series = row_dict.get("产品系列", "")
zg_status = "未整改"
if product_series == "去骨凤爪":
expiry_date = prod_date + relativedelta(months=6)
gap_months = _calculate_gap_months(expiry_date, inspect_date)
if gap_months >= 2:
status, freshness,zg_status = "非大日期", "高",""
elif 1 <= gap_months < 2:
status, freshness = "大日期", "低"
elif 0 <= gap_months < 1:
status, freshness = "临期", "低"
else:
status, freshness = "过期", "低"
else:
expiry_date = prod_date + relativedelta(months=9)
gap_months = _calculate_gap_months(expiry_date, inspect_date)
if gap_months >= 3:
status, freshness,zg_status = "非大日期", "高",""
elif 1 <= gap_months < 3:
status, freshness = "大日期", "低"
elif 0 <= gap_months < 1:
status, freshness = "临期", "低"
else:
status, freshness = "过期", "低"
row_dict["临期状态"] = status
row_dict["新鲜度"] = freshness
row_dict["临期月份数"] = round(gap_months, 2)
row_dict["大日期整改状态"] = zg_status
def _calculate_gap_months(expiry_date, inspect_date):
diff_years = expiry_date.year - inspect_date.year
diff_months = expiry_date.month - inspect_date.month
diff_days = expiry_date.day - inspect_date.day
return diff_years * 12 + diff_months + diff_days / 30.0
import re
# 这里还需要修改
def normalize_month(src_month):
"""
将生产月份字符串标准化为 'yyyy-mm' 格式。
支持的输入格式:
- 'yyyy-mm'(如 '2025-12')→ 保持不变
- 'yyyymm'(如 '202512')→ 转为 '2025-12'
其他格式或无效值返回原值(或可选返回空字符串)
"""
if not isinstance(src_month, str):
return src_month # 非字符串直接返回
src_month = src_month.strip()
if not src_month:
return src_month
# 情况1: 已是 yyyy-mm 格式(例如 2025-12)
if re.fullmatch(r'\d{4}-\d{1,2}', src_month):
# 可选:统一补零为两位月(如 2025-1 → 2025-01)
year, month = src_month.split('-')
month = month.zfill(2) # 确保月份两位
return f"{year}-{month}-01"
# 情况2: 是 yyyymm 格式(6位数字,如 202512)
if re.fullmatch(r'\d{6}', src_month):
year = src_month[:4]
month = src_month[4:].lstrip('0') or '0' # 防止全零
month = src_month[4:].zfill(2) # 直接取后两位并确保两位(更安全)
return f"{year}-{month}-01"
# 其他格式:不处理(或可根据需求返回空)
return src_month
if __name__ == "__main__":
# TODO: 配置 sheet 页名称
print("正在读取【团队】源文件(跳过第 1 行标题,第 2 行作为数据第 1 行)...")
# 修改点:
# 1. skiprows=1 : 跳过物理第 1 行(标题)
# 2. header=None : 关键!告诉 pandas 不要把物理第 2 行当表头,而是当数据。
# 这样物理第 2 行会变成 df 的第 0 行,列名会自动变成 0, 1, 2...
# 这完美匹配你代码中的 row.iloc[4], row.iloc[8] 等逻辑。
df_source_p = pd.read_excel(source_file, skiprows=1, header=None, dtype=str)
# 验证读取结果(可选,用于调试)
print(f"✅ 成功读取 {len(df_source_p)} 行数据。")
if len(df_source_p) > 0:
print("前 2 行数据预览(确认第 2 行是否在列):")
print(df_source_p.head(2))
print(f"列索引范围:0 到 {len(df_source_p.columns) - 1}")
main(df_source_p, '稽查团队', PRODUCT_GROUPS_JC)
\ No newline at end of file
import pandas as pd
import copy
import os
from datetime import datetime
from dateutil.relativedelta import relativedelta
# TODO: === 配置区 ===
# TODO: 配置稽查月份(默认1号)0:1,1:12,2:11,3:10
current_date = (datetime.now().replace(day=1) - relativedelta(months=1)).strftime("%Y-%m-01")
source_file = "/王小卤/风控/代码-新/2026.2-浦零数据源.xlsx"
#source_file_cy = "/Users/a02200059/Desktop/王小卤/风控中心/低价+大日期/2512门店稽查结果/诚予国际.xlsx"
target_file = f"/王小卤/风控/代码-新/大日期{current_date}_2.xlsx"
# 列映射(目标表列名)
COLUMN_MAPPING = {
"稽查日期": "稽查日期",
"稽查来源": "稽查来源",
"勤策门店编码": "勤策门店编码",
"勤策门店名称": "勤策门店名称",
"经销商名称": "经销商名称",
"城市": "城市",
"渠道类型": "渠道类型(稽查源提供)",
"产品系列": "产品系列",
"产品口味": "产品口味",
"产品克重": "产品克重",
"产品价格": "产品价格",
"产品生产月份": "产品生产月份",
}
# ===== 新增:多产品组配置 =====
# 每组:价格列 + 7个口味列 + 产品信息
# 诚予国际
PRODUCT_GROUPS_CY = [
# 第1组:虎皮凤爪 210g
{
"price_col": 7,
"flavor_cols": [8, 9, 10, 11, 12, 13, 14],
"series": "虎皮凤爪",
"weight": "210g",
"flavors": ["卤香", "香辣", "椒麻", "火锅", "微辣", "麻辣", "黑鸭"]
},
# 第2组:虎皮凤爪 105g
{
"price_col": 15,
"flavor_cols": [16, 17, 18, 19, 20, 21, 22],
"series": "虎皮凤爪",
"weight": "105g",
"flavors": ["卤香", "香辣", "椒麻", "火锅", "微辣", "麻辣", "黑鸭"]
},
# 第3组:虎皮凤爪 68g
{
"price_col": 23,
"flavor_cols": [24, 25, 26, 27, 28],
"series": "虎皮凤爪",
"weight": "68g",
"flavors": ["卤香", "香辣", "椒麻", "麻辣", "黑鸭"]
},
# 第4组:鸡肉豆堡 120g
{
"price_col": 29,
"flavor_cols": [30, 31],
"series": "鸡肉豆堡",
"weight": "120g",
"flavors": ["卤香", "香辣"]
},
# 第5组:牛肉豆堡 120g
{
"price_col": 32,
"flavor_cols": [33, 34],
"series": "牛肉豆堡",
"weight": "120g",
"flavors": ["卤香", "香辣"]
},
# 第6组:去骨凤爪 72g
{
"price_col": 35,
"flavor_cols": [36, 37],
"series": "去骨凤爪",
"weight": "72g",
"flavors": ["柠檬", "香辣"]
},
# 第7组:去骨凤爪 138g
{
"price_col": 38,
"flavor_cols": [39, 40],
"series": "去骨凤爪",
"weight": "138g",
"flavors": ["柠檬", "香辣"]
},
# 第8组:虎皮小鸡腿 80g
{
"price_col": 41,
"flavor_cols": [42, 43],
"series": "虎皮小鸡腿",
"weight": "80g",
"flavors": ["卤香", "香辣"]
},
# 第9组:老卤凤爪 95g(与老卤鸭掌共用 price_col=44)
{
"price_col": 44,
"flavor_cols": [45],
"series": "老卤凤爪",
"weight": "95g",
"flavors": ["卤香"]
},
# 第10组:老卤鸭掌 95g(与老卤凤爪共用 price_col=44)
{
"price_col": 44,
"flavor_cols": [46],
"series": "老卤鸭掌",
"weight": "95g",
"flavors": ["卤香"]
},
# 第11组:虎皮凤爪 25g
{
"price_col": 47,
"flavor_cols": [48, 49],
"series": "虎皮凤爪",
"weight": "25g",
"flavors": ["卤香", "香辣"]
},
# 第12组:虎皮凤爪 散称
{
"price_col": 50,
"flavor_cols": [51, 52, 53],
"series": "虎皮凤爪",
"weight": "散称",
"flavors": ["卤香", "香辣", "黑鸭"]
}
]
# 标准输出列定义(与目标表结构保持一致)
STANDARD_COLUMNS = [
"稽查日期", "稽查来源", "大区", "战区", "经销商编码", "经销商名称",
"勤策门店编码", "勤策门店名称", "客户经理工号", "客户经理",
"勤策渠道大类", "稽核渠道(对N列清洗)", "城市", "渠道类型(稽查源提供)",
"产品系列", "产品口味", "产品克重", "产品价格", "是否低价", "破价价差", "低价整改状态",
"低价整改说明", "产品生产月份", "临期月份数", "临期状态", "新鲜度",
"大日期整改状态", "大日期整改说明"
]
PRODUCT_GROUPS = [
# 第1组:虎皮凤爪 210g
{
"price_col": 6,
"flavor_cols": [7, 8, 9, 10, 11, 12, 13],
"series": "虎皮凤爪",
"weight": "210g",
"flavors": ["卤香", "香辣", "椒麻", "火锅", "微辣", "麻辣", "黑鸭"]
},
# 第2组:虎皮凤爪 105g
{
"price_col": 14,
"flavor_cols": [15, 16, 17, 18, 19, 20, 21],
"series": "虎皮凤爪",
"weight": "105g",
"flavors": ["卤香", "香辣", "椒麻", "火锅", "微辣", "麻辣", "黑鸭"]
},
# 第3组:虎皮凤爪 68g
{
"price_col": 22,
"flavor_cols": [23, 24, 25, 26, 27],
"series": "虎皮凤爪",
"weight": "68g",
"flavors": ["卤香", "香辣", "椒麻", "麻辣", "黑鸭"]
},
# 第4组:鸡肉豆堡 120g
{
"price_col": 28,
"flavor_cols": [29, 30],
"series": "鸡肉豆堡",
"weight": "120g",
"flavors": ["卤香", "香辣"]
},
# 第5组:牛肉豆堡 120g
{
"price_col": 31,
"flavor_cols": [32, 33],
"series": "牛肉豆堡",
"weight": "120g",
"flavors": ["卤香", "香辣"]
},
# 第6组:去骨凤爪 72g
{
"price_col": 34,
"flavor_cols": [35, 36],
"series": "去骨凤爪",
"weight": "72g",
"flavors": ["柠檬", "香辣"]
},
# 第7组:去骨凤爪 138g
{
"price_col": 37,
"flavor_cols": [38, 39],
"series": "去骨凤爪",
"weight": "138g",
"flavors": ["柠檬", "香辣"]
},
# 第8组:虎皮小鸡腿 80g
{
"price_col": 40,
"flavor_cols": [41, 42],
"series": "虎皮小鸡腿",
"weight": "80g",
"flavors": ["卤香", "香辣"]
},
# 第9组:老卤凤爪 95g
{
"price_col": 43,
"flavor_cols": [44],
"series": "老卤凤爪",
"weight": "95g",
"flavors": ["卤香"]
},
# 第10组:老卤鸭掌 95g
{
"price_col": 45,
"flavor_cols": [46],
"series": "老卤鸭掌",
"weight": "95g",
"flavors": ["卤香"]
},
# 第11组:虎皮凤爪 25g
{
"price_col": 47,
"flavor_cols": [48, 49],
"series": "虎皮凤爪",
"weight": "25g",
"flavors": ["卤香", "香辣"]
},
# 第12组:虎皮凤爪 散称
{
"price_col": 50,
"flavor_cols": [51, 52, 53],
"series": "虎皮凤爪",
"weight": "散称",
"flavors": ["卤香", "香辣", "黑鸭"]
}
]
# === 主逻辑 ===
def main(df_source,yname,pg):
try:
# 获取目标表结构
try:
df_target = pd.read_excel(target_file, sheet_name="合并后", dtype=str)
existing_columns = df_target.columns.tolist()
except (FileNotFoundError, ValueError):
standard_columns = [
"稽查日期", "稽查来源", "大区", "战区", "经销商编码", "经销商名称",
"勤策门店编码", "勤策门店名称", "客户经理工号", "客户经理",
"勤策渠道大类", "稽核渠道(对N列清洗)", "城市", "渠道类型(稽查源提供)",
"产品系列", "产品口味", "产品克重", "产品价格","是否低价", "破价价差", "低价整改状态",
"低价整改说明", "产品生产月份", "临期月份数", "临期状态", "新鲜度",
"大日期整改状态", "大日期整改说明"
]
df_target = pd.DataFrame(columns=standard_columns)
existing_columns = standard_columns
records = []
# 处理每一行
for idx, row in df_source.iterrows():
# 提取基础字段(B~F)
base_data = {
"勤策门店编码": str(row.iloc[1]).strip() if pd.notna(row.iloc[1]) else "",
"城市": str(row.iloc[2]).strip() if pd.notna(row.iloc[2]) else "",
"勤策门店名称": str(row.iloc[3]).strip() if pd.notna(row.iloc[3]) else "",
"经销商名称": str(row.iloc[4]).strip() if pd.notna(row.iloc[4]) else "",
"渠道类型": str(row.iloc[5]).strip() if pd.notna(row.iloc[5]) else "",
}
# 构建基础行(不含产品信息)
base_row = {}
if COLUMN_MAPPING["稽查日期"] in existing_columns:
base_row[COLUMN_MAPPING["稽查日期"]] = current_date
if COLUMN_MAPPING["稽查来源"] in existing_columns:
base_row[COLUMN_MAPPING["稽查来源"]] = yname
if COLUMN_MAPPING["勤策门店编码"] in existing_columns:
base_row[COLUMN_MAPPING["勤策门店编码"]] = base_data["勤策门店编码"]
if COLUMN_MAPPING["勤策门店名称"] in existing_columns:
base_row[COLUMN_MAPPING["勤策门店名称"]] = base_data["勤策门店名称"]
if COLUMN_MAPPING["经销商名称"] in existing_columns:
base_row[COLUMN_MAPPING["经销商名称"]] = base_data["经销商名称"]
if COLUMN_MAPPING["城市"] in existing_columns:
base_row[COLUMN_MAPPING["城市"]] = base_data["城市"]
if COLUMN_MAPPING["渠道类型"] in existing_columns:
base_row[COLUMN_MAPPING["渠道类型"]] = base_data["渠道类型"]
# 处理每一组产品
for group in pg:
price_col = group["price_col"]
flavor_cols = group["flavor_cols"]
flavors = group["flavors"]
series = group["series"]
weight = group["weight"]
if not flavor_cols:
print("⚠️ 未找到任何口味列!")
# 获取该组价格
src_price = str(row.iloc[price_col]).strip() if pd.notna(row.iloc[price_col]) else ""
if not src_price or src_price == '无价签':
src_price = ''
# 设置价格到基础行副本(仅用于本组)
row_with_price = copy.deepcopy(base_row)
if COLUMN_MAPPING["产品价格"] in existing_columns:
row_with_price[COLUMN_MAPPING["产品价格"]] = src_price
# 处理该组的7个口味
for i, col_idx in enumerate(flavor_cols):
flavor_name = flavors[i]
src_month = str(row.iloc[col_idx]).strip() if pd.notna(row.iloc[col_idx]) else ""
# 情况1: 有生产月份 → 必须生成记录
if src_month:
new_rec = copy.deepcopy(row_with_price)
# 修改src_month格式
src_month = normalize_month(src_month)
_set_product_fields(new_rec, series, flavor_name, weight, src_month, existing_columns)
rDate(new_rec)
records.append(new_rec)
# 情况2: 无生产月份但有价格 → 生成记录(生产月份留空)
elif src_price:
new_rec = copy.deepcopy(row_with_price)
_set_product_fields(new_rec, series, flavor_name, weight, None, existing_columns)
rDate(new_rec)
records.append(new_rec)
if not records:
print("⚠️ 无有效数据需要追加。")
return
df_new = pd.DataFrame(records, columns=existing_columns)
df_combined = pd.concat([df_target, df_new], ignore_index=True)
# 判断目标文件是否存在
if os.path.exists(target_file):
# 文件存在:以追加模式打开,替换 "合并后" sheet
with pd.ExcelWriter(target_file, engine='openpyxl', mode='a', if_sheet_exists='replace') as writer:
df_combined.to_excel(writer, sheet_name="合并后", index=False)
else:
# 文件不存在:创建新文件,只写入 "合并后" sheet
with pd.ExcelWriter(target_file, engine='openpyxl', mode='w') as writer:
df_combined.to_excel(writer, sheet_name="合并后", index=False)
print(f"✅ 成功追加 {len(records)} 条记录到目标表!")
except Exception as e:
print(f"❌ 错误: {e}")
import traceback
traceback.print_exc()
def _set_product_fields(record, series, flavor, weight, prod_month_str, existing_columns):
"""设置产品字段"""
if COLUMN_MAPPING["产品系列"] in existing_columns:
record[COLUMN_MAPPING["产品系列"]] = series
if COLUMN_MAPPING["产品口味"] in existing_columns:
record[COLUMN_MAPPING["产品口味"]] = flavor
if COLUMN_MAPPING["产品克重"] in existing_columns:
record[COLUMN_MAPPING["产品克重"]] = weight
if prod_month_str and COLUMN_MAPPING["产品生产月份"] in existing_columns:
# record[COLUMN_MAPPING["产品生产月份"]] = prod_month_str
# record[COLUMN_MAPPING["产品生产月份"]] = pd.to_datetime(prod_month_str)
try:
#TODO: 假设 prod_month_str 是 "yyyy-mm-dd" 字符串
dt = datetime.strptime(prod_month_str, "%Y-%m-%d")
record[COLUMN_MAPPING["产品生产月份"]] = dt.date() # 👈 关键:转为 date
except (ValueError, TypeError):
# 如果解析失败,保留原值或设为空
record[COLUMN_MAPPING["产品生产月份"]] = None
def rDate(row_dict):
"""计算临期状态(保持你原有的业务逻辑)"""
# TODO: prod_month_str = row_dict.get("产品生产月份", "").strip()
prod_date = row_dict.get("产品生产月份", None)
inspect_date_str = row_dict.get("稽查日期", "").strip()
if not prod_date or not inspect_date_str:
row_dict["临期状态"] = ""
row_dict["新鲜度"] = ""
row_dict["临期月份数"] = ""
return
try:
# TODO: prod_date = datetime.strptime(prod_month_str, "%Y-%m-%d")
inspect_date = datetime.strptime(inspect_date_str, "%Y-%m-%d")
except ValueError:
row_dict["临期状态"] = ""
row_dict["新鲜度"] = ""
row_dict["临期月份数"] = ""
return
product_series = row_dict.get("产品系列", "")
zg_status = "未整改"
if product_series == "去骨凤爪":
expiry_date = prod_date + relativedelta(months=6)
gap_months = _calculate_gap_months(expiry_date, inspect_date)
if gap_months >= 2:
status, freshness,zg_status = "非大日期", "高",""
elif 1 <= gap_months < 2:
status, freshness = "大日期", "低"
elif 0 <= gap_months < 1:
status, freshness = "临期", "低"
else:
status, freshness = "过期", "低"
else:
expiry_date = prod_date + relativedelta(months=9)
gap_months = _calculate_gap_months(expiry_date, inspect_date)
if gap_months >= 3:
status, freshness,zg_status = "非大日期", "高",""
elif 1 <= gap_months < 3:
status, freshness = "大日期", "低"
elif 0 <= gap_months < 1:
status, freshness = "临期", "低"
else:
status, freshness = "过期", "低"
row_dict["临期状态"] = status
row_dict["新鲜度"] = freshness
row_dict["临期月份数"] = round(gap_months, 2)
row_dict["大日期整改状态"] = zg_status
def _calculate_gap_months(expiry_date, inspect_date):
diff_years = expiry_date.year - inspect_date.year
diff_months = expiry_date.month - inspect_date.month
diff_days = expiry_date.day - inspect_date.day
return diff_years * 12 + diff_months + diff_days / 30.0
import re
# todo 这里还需要修改
def normalize_month(src_month):
"""
将生产月份字符串标准化为 'yyyy-mm' 格式。
支持的输入格式:
- 'yyyy-mm'(如 '2025-12')→ 保持不变
- 'yyyymm'(如 '202512')→ 转为 '2025-12'
其他格式或无效值返回原值(或可选返回空字符串)
"""
if not isinstance(src_month, str):
return src_month # 非字符串直接返回
src_month = src_month.strip()
if not src_month:
return src_month
# 情况1: 已是 yyyy-mm 格式(例如 2025-12)
if re.fullmatch(r'\d{4}-\d{1,2}', src_month):
# 可选:统一补零为两位月(如 2025-1 → 2025-01)
year, month = src_month.split('-')
month = month.zfill(2) # 确保月份两位
return f"{year}-{month}-01"
# 情况2: 是 yyyymm 格式(6位数字,如 202512)
if re.fullmatch(r'\d{6}', src_month):
year = src_month[:4]
month = src_month[4:].lstrip('0') or '0' # 防止全零
month = src_month[4:].zfill(2) # 直接取后两位并确保两位(更安全)
return f"{year}-{month}-01"
# 其他格式:不处理(或可根据需求返回空)
return src_month
def transform(df_source, yname, pg, audit_date: str = None):
"""
供 API 调用的数据转换入口:接收 DataFrame,返回清洗后的记录列表,不读写任何文件。
Args:
df_source: pandas DataFrame,列通过 iloc 按位置访问(header=2 读入后索引从 0 开始)
yname: 稽查来源名称,如 '浦零' 或 '诚予'
pg: 产品组配置列表(PRODUCT_GROUPS 或 PRODUCT_GROUPS_CY)
audit_date: 稽查日期字符串,格式 'yyyy-mm-dd';为 None 时自动取上月1号
Returns:
list[dict]: 按 STANDARD_COLUMNS 结构整理好的记录列表(产品生产月份为字符串)
"""
from datetime import date as date_type
if audit_date is None:
audit_date = (datetime.now().replace(day=1) - relativedelta(months=1)).strftime("%Y-%m-01")
records = []
for idx, row in df_source.iterrows():
base_data = {
"勤策门店编码": str(row.iloc[1]).strip() if pd.notna(row.iloc[1]) else "",
"城市": str(row.iloc[2]).strip() if pd.notna(row.iloc[2]) else "",
"勤策门店名称": str(row.iloc[3]).strip() if pd.notna(row.iloc[3]) else "",
"经销商名称": str(row.iloc[4]).strip() if pd.notna(row.iloc[4]) else "",
"渠道类型": str(row.iloc[5]).strip() if pd.notna(row.iloc[5]) else "",
}
base_row = {}
if COLUMN_MAPPING["稽查日期"] in STANDARD_COLUMNS:
base_row[COLUMN_MAPPING["稽查日期"]] = audit_date
if COLUMN_MAPPING["稽查来源"] in STANDARD_COLUMNS:
base_row[COLUMN_MAPPING["稽查来源"]] = yname
if COLUMN_MAPPING["勤策门店编码"] in STANDARD_COLUMNS:
base_row[COLUMN_MAPPING["勤策门店编码"]] = base_data["勤策门店编码"]
if COLUMN_MAPPING["勤策门店名称"] in STANDARD_COLUMNS:
base_row[COLUMN_MAPPING["勤策门店名称"]] = base_data["勤策门店名称"]
if COLUMN_MAPPING["经销商名称"] in STANDARD_COLUMNS:
base_row[COLUMN_MAPPING["经销商名称"]] = base_data["经销商名称"]
if COLUMN_MAPPING["城市"] in STANDARD_COLUMNS:
base_row[COLUMN_MAPPING["城市"]] = base_data["城市"]
if COLUMN_MAPPING["渠道类型"] in STANDARD_COLUMNS:
base_row[COLUMN_MAPPING["渠道类型"]] = base_data["渠道类型"]
for group in pg:
price_col = group["price_col"]
flavor_cols = group["flavor_cols"]
flavors = group["flavors"]
series = group["series"]
weight = group["weight"]
src_price = str(row.iloc[price_col]).strip() if pd.notna(row.iloc[price_col]) else ""
if not src_price or src_price == '无价签':
src_price = ''
row_with_price = copy.deepcopy(base_row)
if COLUMN_MAPPING["产品价格"] in STANDARD_COLUMNS:
row_with_price[COLUMN_MAPPING["产品价格"]] = src_price
for i, col_idx in enumerate(flavor_cols):
flavor_name = flavors[i]
src_month = str(row.iloc[col_idx]).strip() if pd.notna(row.iloc[col_idx]) else ""
if src_month:
new_rec = copy.deepcopy(row_with_price)
src_month = normalize_month(src_month)
_set_product_fields(new_rec, series, flavor_name, weight, src_month, STANDARD_COLUMNS)
rDate(new_rec)
records.append(new_rec)
elif src_price:
new_rec = copy.deepcopy(row_with_price)
_set_product_fields(new_rec, series, flavor_name, weight, None, STANDARD_COLUMNS)
rDate(new_rec)
records.append(new_rec)
# 将 date 对象统一转为 ISO 字符串,保证 JSON 可序列化
for rec in records:
for k, v in rec.items():
if isinstance(v, date_type):
rec[k] = v.isoformat()
return records
if __name__ == "__main__":
# TODO: 配置sheet页名称
print("正在读取【浦零】源文件(跳过前三行)...")
df_source_p = pd.read_excel(source_file, header=2, dtype=str)
main(df_source_p,'浦零',PRODUCT_GROUPS)
#print("正在读取【诚予】源文件(跳过前三行)...")
#df_source_c = pd.read_excel(source_file_cy, sheet_name="Sheet1", header=2, dtype=str)
#main(df_source_c,'诚予',PRODUCT_GROUPS_CY)
"""
数据清洗系统 - FastAPI 应用主程序
Description: 提供 Excel 数据解析、清洗和存储的 API 服务
"""
from fastapi import FastAPI, BackgroundTasks
from pydantic import BaseModel
import logging
import uuid
import asyncio
import math
import pandas as pd
from io import BytesIO
from datetime import datetime
from typing import Optional, Dict, Any
# 导入业务模块
from core.excel_handler import ExcelHandler
from core.data_cleaner import DataCleaner
from core.db_handler import DatabaseHandler
from core.progress_manager import ProgressManager
from utils.exceptions import DataCleaningException, DatabaseException
from utils.validators import validate_excel_url
from utils.response import BizCode, ok_resp, fail_resp
def _sanitize_nan(records: list) -> list:
"""将列表中每行 dict 里的 float NaN / Inf 替换为 None,确保 JSON 可序列化。"""
sanitized = []
for row in records:
sanitized.append({
k: (None if isinstance(v, float) and (math.isnan(v) or math.isinf(v)) else v)
for k, v in row.items()
})
return sanitized
# 配置日志
logging.basicConfig(
level=logging.INFO, # 只记录 INFO 以上的日志
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s' # 时间 - 模块名 - 级别 - 内容
)
logger = logging.getLogger(__name__) # __name__ 运行时获取模块名
# 创建 FastAPI 应用
app = FastAPI(
title="数据清洗系统",
description="用于数据解析、清洗和持久化的 API 服务",
version="1.0.0"
)
# ==================== 请求数据模型 ====================
class CleaningRequest(BaseModel):
"""数据清洗请求模型"""
excel_url: Optional[str] = None # 普通清洗模式必填;风控稽查模式可不传
department: str
description: Optional[str] = None
audit_date: Optional[str] = None # 稽查日期,格式 'yyyy-mm-dd',不传则取上月1号
# ── 风控稽查数据清洗 专用字段 ──────────────────────────────────
year: Optional[int] = None # 数据所属年
month: Optional[int] = None # 数据所属月
day: Optional[int] = None # 数据所属日
team_url: Optional[str] = None # 团队数据表链接
puling_url: Optional[str] = None # 浦零数据表链接
chengyu_url: Optional[str] = None # 诚予数据表链接
class SavingRequest(BaseModel):
"""数据保存请求模型"""
task_id: str
table_name: str
# ==================== 业务逻辑 ====================
class DataCleaningService:
"""数据清洗服务主类"""
# 性能基准参数(可根据实际情况调整)
DOWNLOAD_TIME_BASE = 2 # 下载和解析基础时间(秒)
DOWNLOAD_TIME_PER_ROW = 0.0001 # 每行数据的下载时间(秒)
CLEANING_TIME_PER_ROW = 0.001 # 每行数据的清洗时间(秒)
VALIDATION_TIME_BASE = 1 # 验证基础时间(秒)
CACHING_TIME_PER_ROW = 0.0001 # 每行数据的缓存时间(秒)
CACHE_TTL_SECONDS = 1800 # cache 保留时长:30 分钟
def __init__(self):
self.progress_manager = ProgressManager()
self.excel_handler = ExcelHandler()
self.data_cleaner = DataCleaner()
self.db_handler = DatabaseHandler()
# 存储已清洗的数据(内存中,可扩展为 Redis)
self.cleaned_data_cache: Dict[str, Any] = {}
def _evict_expired_cache(self):
"""清除超过 TTL 的 cache 条目,在写入和读取时调用"""
now = datetime.now()
expired = [
tid for tid, v in self.cleaned_data_cache.items()
if (now - v['created_at']).total_seconds() > self.CACHE_TTL_SECONDS
]
for tid in expired:
del self.cleaned_data_cache[tid]
logger.info(f"[cache] 已清除过期任务 {tid}")
def estimate_completion_time(self, row_count: int) -> int:
"""
根据数据行数预估完成时间
Args:
row_count: Excel 文件的数据行数
Returns:
int: 预估完成时间(秒)
"""
# 计算各阶段时间
download_time = self.DOWNLOAD_TIME_BASE + (row_count * self.DOWNLOAD_TIME_PER_ROW)
validation_time = self.VALIDATION_TIME_BASE
cleaning_time = row_count * self.CLEANING_TIME_PER_ROW
caching_time = row_count * self.CACHING_TIME_PER_ROW
# 总时间(向上取整)
total_time = int(download_time + validation_time + cleaning_time + caching_time)
# 最少 5 秒,最多 3600 秒(1小时)
return max(5, min(total_time, 3600))
async def clean_data_from_url(
self,
task_id: str,
excel_url: str,
department: str,
raw_data: list = None,
audit_date: str = None
) -> Dict[str, Any]:
"""
从 URL 下载并清洗 Excel 数据
Args:
task_id: 任务唯一标识
excel_url: Excel 文件的网络链接
department: 业务部门名称
raw_data: 可选,已下载的原始数据(由路由层传入以避免重复下载)
audit_date: 稽查日期字符串,格式 'yyyy-mm-dd'
Returns:
包含清洗结果的字典
"""
try:
# 1. 记录任务开始
self.progress_manager.update_progress(
task_id,
status="processing",
progress=10,
message="开始下载 Excel 文件"
)
logger.info(f"[{task_id}] 开始处理数据清洗任务")
# 2. 下载并解析 Excel(若路由层已下载则直接复用,避免重复请求)
if raw_data is None:
self.progress_manager.update_progress(
task_id,
status="processing",
progress=20,
message="正在解析 Excel 文件"
)
raw_data = await self.excel_handler.fetch_and_parse(excel_url)
logger.info(f"[{task_id}] 成功解析 Excel,数据行数: {len(raw_data)}")
# 3. 数据验证
self.progress_manager.update_progress(
task_id,
status="processing",
progress=30,
message="正在验证数据"
)
if not raw_data:
raise DataCleaningException("解析的 Excel 数据为空")
# 4. 执行数据清洗
self.progress_manager.update_progress(
task_id,
status="processing",
progress=50,
message="正在清洗数据"
)
cleaned_data = await self.data_cleaner.clean(
raw_data,
department,
progress_callback=lambda p, m, count=None: self.progress_manager.update_progress(
task_id,
status="processing",
progress=int(50 + p * 0.4), # 进度从50%到90%
message=m,
processed_count=count
),
audit_date=audit_date
)
logger.info(f"[{task_id}] 数据清洗完成,清洗后数据行数: {len(cleaned_data)}")
# 5. 缓存清洗后的数据(写入前先清除过期条目)
self.progress_manager.update_progress(
task_id,
status="processing",
progress=90,
message="正在缓存清洗后的数据"
)
self._evict_expired_cache()
safe_data = _sanitize_nan(cleaned_data)
self.cleaned_data_cache[task_id] = {
'data': safe_data,
'department': department,
'created_at': datetime.now(),
'row_count': len(safe_data)
}
# 6. 任务完成
self.progress_manager.update_progress(
task_id,
status="completed",
progress=100,
message="数据清洗完成,等待前端确认",
processed_count=len(cleaned_data)
)
return {
'task_id': task_id,
'status': 'completed',
'message': '数据清洗成功',
'data_preview': cleaned_data[:5], # 返回前5行用于预览
'total_rows': len(cleaned_data)
}
except DataCleaningException as e:
logger.error(f"[{task_id}] 数据清洗业务异常: {str(e)}")
self.progress_manager.update_progress(
task_id,
status="failed",
progress=0,
message=f"清洗失败: {str(e)}"
)
raise
except Exception as e:
logger.error(f"[{task_id}] 数据清洗系统异常: {str(e)}", exc_info=True)
self.progress_manager.update_progress(
task_id,
status="failed",
progress=0,
message=f"系统异常: {str(e)}"
)
raise DataCleaningException(f"未知错误: {str(e)}")
async def save_cleaned_data(
self,
task_id: str,
table_name: str
) -> Dict[str, Any]:
"""
将清洗后的数据保存到数据库
Args:
task_id: 任务唯一标识
table_name: 目标表名
Returns:
包含保存结果的字典
"""
try:
logger.info(f"[{task_id}] 开始保存数据到数据库")
# 验证数据是否存在(先清除过期条目)
self._evict_expired_cache()
if task_id not in self.cleaned_data_cache:
raise DatabaseException(f"任务 {task_id} 的清洗数据不存在或已过期(超过30分钟)")
cleaned_data = self.cleaned_data_cache[task_id]['data']
# 保存到数据库
affected_rows = await self.db_handler.insert_data(
table_name,
cleaned_data
)
logger.info(f"[{task_id}] 成功保存 {affected_rows} 行数据到 {table_name}")
# 清理缓存
del self.cleaned_data_cache[task_id]
return {
'task_id': task_id,
'status': 'saved',
'message': '数据已成功保存到数据库',
'affected_rows': affected_rows
}
except DatabaseException as e:
logger.error(f"[{task_id}] 数据库异常: {str(e)}")
raise
except Exception as e:
logger.error(f"[{task_id}] 保存数据时出错: {str(e)}", exc_info=True)
raise DatabaseException(f"保存失败: {str(e)}")
async def clean_fengkong_data(
self,
task_id: str,
team_url: Optional[str],
puling_url: Optional[str],
chengyu_url: Optional[str],
audit_date: Optional[str],
) -> Dict[str, Any]:
"""
风控稽查数据清洗:分别下载团队、浦零、诚予数据源,各自清洗后合并为一张大宽表,
结果存入内存缓存,不写本地文件。
Args:
task_id: 任务唯一标识
team_url: 团队数据表下载链接(可为 None)
puling_url: 浦零数据表下载链接(可为 None)
chengyu_url: 诚予数据表下载链接(可为 None)
audit_date: 稽查日期,格式 'yyyy-mm-dd';为 None 时各模块自动取上月1号
"""
from core_py.数据转换_团队 import (
transform as team_transform,
PRODUCT_GROUPS_JC,
STANDARD_COLUMNS,
)
from core_py.数据转换_诚予_浦零 import (
transform as pl_cy_transform,
PRODUCT_GROUPS,
PRODUCT_GROUPS_CY,
)
try:
self.progress_manager.update_progress(
task_id, status="processing", progress=5, message="开始风控稽查数据清洗"
)
logger.info(f"[{task_id}] 开始风控稽查数据清洗,audit_date={audit_date}")
all_records = []
progress_step = 0
source_count = sum(1 for u in [team_url, puling_url, chengyu_url] if u)
progress_per_source = int(80 / source_count) if source_count else 80
# ── 1. 团队数据 ──────────────────────────────────────────
if team_url:
progress_step += progress_per_source
self.progress_manager.update_progress(
task_id, status="processing",
progress=max(10, progress_step - progress_per_source + 10),
message="正在下载团队数据表..."
)
raw_bytes = await self.excel_handler.fetch_bytes(team_url)
df_team = await asyncio.to_thread(
pd.read_excel, BytesIO(raw_bytes), skiprows=1, header=None, dtype=str
)
self.progress_manager.update_progress(
task_id, status="processing",
progress=max(10, progress_step - progress_per_source // 2),
message="正在清洗团队数据..."
)
records_team = await asyncio.to_thread(
team_transform, df_team, "稽查团队", PRODUCT_GROUPS_JC, audit_date
)
all_records.extend(records_team)
logger.info(f"[{task_id}] 团队数据清洗完成,{len(records_team)} 条记录")
# ── 2. 浦零数据 ──────────────────────────────────────────
if puling_url:
progress_step += progress_per_source
self.progress_manager.update_progress(
task_id, status="processing",
progress=max(15, progress_step - progress_per_source + 10),
message="正在下载浦零数据表..."
)
raw_bytes = await self.excel_handler.fetch_bytes(puling_url)
df_pl = await asyncio.to_thread(
pd.read_excel, BytesIO(raw_bytes), header=2, dtype=str
)
self.progress_manager.update_progress(
task_id, status="processing",
progress=max(15, progress_step - progress_per_source // 2),
message="正在清洗浦零数据..."
)
records_pl = await asyncio.to_thread(
pl_cy_transform, df_pl, "浦零", PRODUCT_GROUPS, audit_date
)
all_records.extend(records_pl)
logger.info(f"[{task_id}] 浦零数据清洗完成,{len(records_pl)} 条记录")
# ── 3. 诚予数据 ──────────────────────────────────────────
if chengyu_url:
progress_step += progress_per_source
self.progress_manager.update_progress(
task_id, status="processing",
progress=max(20, progress_step - progress_per_source + 10),
message="正在下载诚予数据表..."
)
raw_bytes = await self.excel_handler.fetch_bytes(chengyu_url)
df_cy = await asyncio.to_thread(
pd.read_excel, BytesIO(raw_bytes), header=2, dtype=str
)
self.progress_manager.update_progress(
task_id, status="processing",
progress=max(20, progress_step - progress_per_source // 2),
message="正在清洗诚予数据..."
)
records_cy = await asyncio.to_thread(
pl_cy_transform, df_cy, "诚予", PRODUCT_GROUPS_CY, audit_date
)
all_records.extend(records_cy)
logger.info(f"[{task_id}] 诚予数据清洗完成,{len(records_cy)} 条记录")
# ── 4. 合并为大宽表(内存,不写文件) ──────────────────
self.progress_manager.update_progress(
task_id, status="processing", progress=90, message="正在合并数据宽表..."
)
df_merged = pd.DataFrame(all_records, columns=STANDARD_COLUMNS)
merged_records = _sanitize_nan(
df_merged.where(pd.notna(df_merged), None).to_dict(orient="records")
)
logger.info(f"[{task_id}] 大宽表合并完成,共 {len(merged_records)} 条记录")
# ── 5. 写入内存缓存 ──────────────────────────────────────
self._evict_expired_cache()
self.cleaned_data_cache[task_id] = {
"data": merged_records,
"department": "风控稽查数据清洗",
"created_at": datetime.now(),
"row_count": len(merged_records),
}
self.progress_manager.update_progress(
task_id, status="completed", progress=100,
message=f"风控稽查数据清洗完成,共 {len(merged_records)} 条记录,等待前端确认",
processed_count=len(merged_records)
)
return {
"task_id": task_id,
"status": "completed",
"message": "风控稽查数据清洗成功",
"data_preview": merged_records[:5],
"total_rows": len(merged_records),
}
except Exception as e:
logger.error(f"[{task_id}] 风控稽查数据清洗失败: {str(e)}", exc_info=True)
self.progress_manager.update_progress(
task_id, status="failed", progress=0,
message=f"清洗失败: {str(e)}"
)
raise
# ==================== 初始化服务 ====================
service = DataCleaningService()
# ==================== API 路由 ====================
@app.post("/api/v1/clean")
async def start_cleaning(request: CleaningRequest, background_tasks: BackgroundTasks):
"""
启动数据清洗任务
Returns: { code, msg, data: { task_id, status, estimated_completion_time, total_rows } }
"""
try:
task_id = str(uuid.uuid4())
logger.info(f"创建新任务: {task_id}, 部门: {request.department}")
# ── 风控稽查数据清洗 专用分支 ──────────────────────────────
if request.department == "风控稽查数据清洗":
if not any([request.team_url, request.puling_url, request.chengyu_url]):
return fail_resp(BizCode.BAD_REQUEST, "风控稽查数据清洗至少需要提供一个数据源地址(team_url / puling_url / chengyu_url)")
# 从 year/month/day 构造稽查日期,未传则由清洗模块自动取上月1号
audit_date = None
if request.year and request.month and request.day:
audit_date = f"{request.year}-{request.month:02d}-{request.day:02d}"
estimated_rows = 1000
estimated_time = service.estimate_completion_time(estimated_rows)
# 提前写入 queued 进度,避免前端轮询时返回 404
service.progress_manager.update_progress(
task_id, status="queued", progress=0, message="任务已创建,等待处理"
)
background_tasks.add_task(
service.clean_fengkong_data,
task_id,
request.team_url,
request.puling_url,
request.chengyu_url,
audit_date,
)
# ── 普通清洗分支 ───────────────────────────────────────────
else:
if not validate_excel_url(request.excel_url):
return fail_resp(BizCode.BAD_REQUEST, "Excel URL 格式无效")
estimated_rows = 0
estimated_time = 5
prefetched_raw_data = None
try:
prefetched_raw_data = await service.excel_handler.fetch_and_parse(request.excel_url)
estimated_rows = len(prefetched_raw_data)
estimated_time = service.estimate_completion_time(estimated_rows)
logger.info(f"[{task_id}] 预估数据行数: {estimated_rows}, 预估完成时间: {estimated_time}秒")
except Exception as e:
logger.warning(f"[{task_id}] 预读 Excel 失败,后台任务将重新下载: {str(e)}")
estimated_rows = 1000
estimated_time = service.estimate_completion_time(estimated_rows)
background_tasks.add_task(
service.clean_data_from_url,
task_id,
request.excel_url,
request.department,
prefetched_raw_data,
request.audit_date,
)
return ok_resp(
data={
"task_id": task_id,
"status": "queued",
"estimated_completion_time": estimated_time,
"total_rows": estimated_rows,
},
msg="任务已创建,正在处理中..."
)
except Exception as e:
logger.error(f"启动清洗任务失败: {str(e)}")
return fail_resp(BizCode.SERVER_ERROR, f"启动任务失败: {str(e)}", http_status=500)
@app.get("/api/v1/progress/{task_id}")
async def get_progress(task_id: str):
"""
获取数据清洗进度(HTTP 轮询,建议前端每 500ms-1s 调用一次)
Returns: { code, msg, data: { task_id, status, progress, message, timestamp } }
"""
try:
progress_data = service.progress_manager.get_progress(task_id)
if not progress_data:
return fail_resp(BizCode.NOT_FOUND, "任务不存在", http_status=404)
return ok_resp(data=progress_data)
except Exception as e:
logger.error(f"获取进度失败: {str(e)}")
return fail_resp(BizCode.SERVER_ERROR, "获取进度失败", http_status=500)
@app.get("/api/v1/result/{task_id}")
async def get_cleaning_result(task_id: str):
"""
获取清洗结果及数据预览(任务完成后调用)
Returns: { code, msg, data: { task_id, status, data_preview, total_rows, department } }
"""
try:
progress_data = service.progress_manager.get_progress(task_id)
if not progress_data:
return fail_resp(BizCode.NOT_FOUND, "任务不存在", http_status=404)
if progress_data['status'] == 'processing':
return fail_resp(BizCode.TASK_PROCESSING, "任务仍在处理中", http_status=202)
if progress_data['status'] == 'failed':
return fail_resp(BizCode.TASK_FAILED, progress_data['message'])
service._evict_expired_cache()
if task_id not in service.cleaned_data_cache:
return fail_resp(BizCode.NOT_FOUND, "清洗数据不存在或已过期(超过30分钟)", http_status=404)
cached = service.cleaned_data_cache[task_id]
return ok_resp(
data={
"task_id": task_id,
"status": "ready_to_save",
"data_preview": cached['data'][:10],
"total_rows": cached['row_count'],
"department": cached['department']
},
msg="数据清洗完成,可进行保存"
)
except Exception as e:
logger.error(f"获取清洗结果失败: {str(e)}")
return fail_resp(BizCode.SERVER_ERROR, "获取结果失败", http_status=500)
@app.post("/api/v1/save")
async def save_cleaned_data(request: SavingRequest):
"""
保存清洗后的数据到 MySQL 数据库(前端确认数据无误后调用)
Returns: { code, msg, data: { task_id, status, affected_rows } }
"""
try:
if not request.task_id or not request.table_name:
return fail_resp(BizCode.BAD_REQUEST, "参数不完整:task_id 和 table_name 均为必填")
result = await service.save_cleaned_data(request.task_id, request.table_name)
return ok_resp(data=result, msg="数据已成功保存到数据库")
except DatabaseException as e:
logger.error(f"保存数据失败: {str(e)}")
return fail_resp(BizCode.DB_ERROR, str(e), http_status=500)
except Exception as e:
logger.error(f"保存数据时发生错误: {str(e)}")
return fail_resp(BizCode.SERVER_ERROR, f"保存失败: {str(e)}", http_status=500)
@app.get("/api/v1/health")
async def health_check():
"""健康检查接口"""
return ok_resp(
data={"service": "数据清洗系统", "timestamp": str(datetime.now())},
msg="healthy"
)
@app.get("/")
async def root():
"""根路由 - API 欢迎信息"""
return ok_resp(
data={"version": "1.0.0", "docs": "/docs", "redoc": "/redoc"},
msg="欢迎使用数据清洗系统"
)
# ==================== 异常处理 ====================
@app.exception_handler(DataCleaningException)
async def data_cleaning_exception_handler(request, exc):
"""处理数据清洗异常"""
logger.error(f"DataCleaningException: {str(exc)}")
return fail_resp(BizCode.TASK_FAILED, str(exc), http_status=400)
@app.exception_handler(DatabaseException)
async def database_exception_handler(request, exc):
"""处理数据库异常"""
logger.error(f"DatabaseException: {str(exc)}")
return fail_resp(BizCode.DB_ERROR, str(exc), http_status=500)
# ==================== 应用启动和关闭事件 ====================
@app.on_event("startup")
async def startup_event():
"""应用启动时的初始化"""
logger.info("数据清洗系统启动")
try:
# 初始化数据库连接等
pass
except Exception as e:
logger.error(f"启动时出错: {str(e)}")
@app.on_event("shutdown")
async def shutdown_event():
"""应用关闭时的清理"""
logger.info("数据清洗系统关闭")
try:
# 关闭数据库连接等
pass
except Exception as e:
logger.error(f"关闭时出错: {str(e)}")
# ==================== 主程序入口 ====================
if __name__ == "__main__":
import uvicorn
# 运行 Uvicorn 服务器
uvicorn.run(
app,
host="0.0.0.0",
port=8000,
log_level="info",
reload=True # 开发环境下启用热重载
)
fastapi==0.104.1
uvicorn==0.24.0
python-multipart==0.0.6
openpyxl==3.1.5
requests==2.31.0
aiohttp==3.9.1
mysql-connector-python==8.2.0
pydantic==2.4.2
python-dotenv==1.0.0
pandas>=2.0.0
python-dateutil>=2.8.2
\ No newline at end of file
/*
Navicat MySQL Data Transfer
Source Server : t100_dev
Source Server Version : 50744
Source Host : 192.168.100.39:25301
Source Database : market_bi
Target Server Type : MYSQL
Target Server Version : 50744
File Encoding : 65001
Date: 2026-03-09 18:13:42
*/
SET FOREIGN_KEY_CHECKS=0;
-- ----------------------------
-- Table structure for risk_audit_visit
-- ----------------------------
DROP TABLE IF EXISTS `risk_audit_visit`;
CREATE TABLE `risk_audit_visit` (
`rav_id` int(11) NOT NULL AUTO_INCREMENT COMMENT '主键',
`audit_date` date DEFAULT NULL COMMENT '稽查日期',
`source` varchar(20) DEFAULT NULL COMMENT '稽查来源',
`region_name` varchar(20) DEFAULT NULL COMMENT '大区',
`district_name` varchar(20) DEFAULT NULL COMMENT '战区',
`dealer_code` varchar(10) DEFAULT NULL COMMENT '经销商编码',
`dealer_name` varchar(100) DEFAULT NULL COMMENT '经销商名称',
`store_code` varchar(20) DEFAULT NULL COMMENT '门店编码',
`store_name` varchar(100) DEFAULT NULL COMMENT '勤策门店',
`f_emp_no` varchar(20) DEFAULT NULL COMMENT '客户经理工号',
`f_emp_name` varchar(100) DEFAULT NULL COMMENT '客户经理名称',
`qin_ce_type_large` varchar(20) DEFAULT NULL COMMENT '勤策渠道大类',
`jh_channel_type` varchar(20) DEFAULT NULL COMMENT '稽查渠道类型',
`city` varchar(30) DEFAULT NULL COMMENT '城市',
`channel_type` varchar(30) DEFAULT NULL COMMENT '渠道类型(稽查源提供)',
`series` varchar(20) DEFAULT NULL COMMENT '产品系列',
`taste` varchar(20) DEFAULT NULL COMMENT '产品口味',
`weight` varchar(20) DEFAULT NULL COMMENT '产品克重',
`price` decimal(10,2) DEFAULT NULL COMMENT '产品价格',
`low_price` varchar(20) DEFAULT NULL COMMENT '是否低价:低价,正常',
`low_price_diff` decimal(10,2) DEFAULT NULL COMMENT '价差',
`low_price_status` varchar(20) DEFAULT NULL COMMENT '低价整改状态',
`low_price_rectify` varchar(100) DEFAULT NULL COMMENT '低价整改说明',
`production_month` date DEFAULT NULL COMMENT '产品生产月份',
`near_month_num` int(11) DEFAULT NULL COMMENT '临期月份数',
`near_month_status` varchar(20) DEFAULT NULL COMMENT '临期状态',
`fresh_status` varchar(20) DEFAULT NULL COMMENT '新鲜度',
`large_date_status` varchar(20) DEFAULT NULL COMMENT '大日期整改状态',
`large_date_rectify` varchar(100) DEFAULT NULL COMMENT '大日期整改说明',
PRIMARY KEY (`rav_id`),
KEY `audit` (`audit_date`),
KEY `dealer` (`dealer_code`,`dealer_name`),
KEY `product_index` (`series`,`taste`,`weight`),
KEY `regiondistrict` (`region_name`,`district_name`),
KEY `type_small` (`jh_channel_type`),
KEY `weight_index` (`weight`)
) ENGINE=InnoDB AUTO_INCREMENT=493621 DEFAULT CHARSET=utf8mb4 COMMENT='稽查走访价格大日期表';
"""
API 测试脚本
用于快速测试 API 的各个端点
"""
import asyncio
import httpx
import json
from datetime import datetime
BASE_URL = "http://localhost:8000"
class APITester:
"""API 测试类"""
def __init__(self, base_url: str = BASE_URL):
self.base_url = base_url
self.task_id: str = None
async def test_health_check(self):
"""测试健康检查接口"""
print("\n" + "="*50)
print("测试:健康检查接口")
print("="*50)
try:
async with httpx.AsyncClient() as client:
response = await client.get(f"{self.base_url}/api/v1/health")
print(f"状态码: {response.status_code}")
print(f"响应: {json.dumps(response.json(), indent=2, ensure_ascii=False)}")
except Exception as e:
print(f"错误: {str(e)}")
async def test_start_cleaning(self):
"""测试启动清洗任务接口"""
print("\n" + "="*50)
print("测试:启动数据清洗任务")
print("="*50)
payload = {
"excel_url": "https://example.com/test_data.xlsx",
"department": "sales",
"description": "测试数据清洗"
}
try:
async with httpx.AsyncClient() as client:
response = await client.post(
f"{self.base_url}/api/v1/clean",
json=payload
)
print(f"状态码: {response.status_code}")
data = response.json()
print(f"响应: {json.dumps(data, indent=2, ensure_ascii=False)}")
if response.status_code == 200:
self.task_id = data.get('task_id')
print(f"\n✓ 任务创建成功,Task ID: {self.task_id}")
except Exception as e:
print(f"错误: {str(e)}")
async def test_get_progress(self):
"""测试获取进度接口"""
if not self.task_id:
print("跳过:需要先创建任务")
return
print("\n" + "="*50)
print("测试:获取数据清洗进度")
print("="*50)
try:
async with httpx.AsyncClient() as client:
response = await client.get(
f"{self.base_url}/api/v1/progress/{self.task_id}"
)
print(f"状态码: {response.status_code}")
print(f"响应: {json.dumps(response.json(), indent=2, ensure_ascii=False, default=str)}")
except Exception as e:
print(f"错误: {str(e)}")
async def test_get_result(self):
"""测试获取清洗结果接口"""
if not self.task_id:
print("跳过:需要先创建任务")
return
print("\n" + "="*50)
print("测试:获取清洗结果")
print("="*50)
try:
async with httpx.AsyncClient() as client:
response = await client.get(
f"{self.base_url}/api/v1/result/{self.task_id}"
)
print(f"状态码: {response.status_code}")
data = response.json()
print(f"响应: {json.dumps(data, indent=2, ensure_ascii=False, default=str)}")
except Exception as e:
print(f"错误: {str(e)}")
async def test_save_data(self):
"""测试保存数据接口"""
if not self.task_id:
print("跳过:需要先创建任务")
return
print("\n" + "="*50)
print("测试:保存清洗后的数据")
print("="*50)
payload = {
"task_id": self.task_id,
"table_name": "sales_data"
}
try:
async with httpx.AsyncClient() as client:
response = await client.post(
f"{self.base_url}/api/v1/save",
json=payload
)
print(f"状态码: {response.status_code}")
print(f"响应: {json.dumps(response.json(), indent=2, ensure_ascii=False)}")
except Exception as e:
print(f"错误: {str(e)}")
async def run_all_tests(self):
"""运行所有测试"""
print("\n")
print("╔" + "="*48 + "╗")
print("║" + " "*10 + "数据清洗系统 API 测试" + " "*16 + "║")
print("║" + f" "*10 + f"时间: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}" + " "*15 + "║")
print("╚" + "="*48 + "╝")
await self.test_health_check()
await asyncio.sleep(1)
await self.test_start_cleaning()
await asyncio.sleep(2)
await self.test_get_progress()
await asyncio.sleep(1)
await self.test_get_result()
await asyncio.sleep(1)
print("\n" + "="*50)
print("所有测试完成!")
print("="*50 + "\n")
async def main():
"""主函数"""
tester = APITester()
await tester.run_all_tests()
if __name__ == "__main__":
print("\n提示:确保 FastAPI 服务已在 http://localhost:8000 运行中\n")
asyncio.run(main())
"""Utils 工具模块"""
from utils.response import BizCode, ApiResponse, ok_resp, fail_resp
__all__ = ["BizCode", "ApiResponse", "ok_resp", "fail_resp"]
"""
异常定义模块
"""
class DataCleaningException(Exception):
"""数据清洗异常"""
pass
class DatabaseException(Exception):
"""数据库异常"""
pass
class ExcelParsingException(Exception):
"""Excel 解析异常"""
pass
class ValidationException(Exception):
"""验证异常"""
pass
"""
统一响应格式封装模块
所有接口统一返回: { code: 业务状态码, msg: 消息, data: 数据 }
"""
from enum import IntEnum
from typing import Any
from fastapi.responses import JSONResponse
from pydantic import BaseModel
class BizCode(IntEnum):
"""业务逻辑状态码"""
SUCCESS = 200 # 通用成功
TASK_QUEUED = 201 # 任务已入队(异步场景)
TASK_PROCESSING = 202 # 任务处理中
BAD_REQUEST = 400 # 请求参数错误
NOT_FOUND = 404 # 资源不存在
TASK_FAILED = 422 # 任务执行失败(业务层)
SERVER_ERROR = 500 # 服务器内部错误
DB_ERROR = 501 # 数据库错误
EXCEL_ERROR = 502 # Excel 解析错误
class ApiResponse(BaseModel):
"""统一 API 响应体"""
code: int
msg: str
data: Any = None
def ok_resp(data: Any = None, msg: str = "success") -> JSONResponse:
"""返回成功的 JSONResponse(HTTP 200)"""
return JSONResponse(
status_code=200,
content=ApiResponse(code=BizCode.SUCCESS, msg=msg, data=data).model_dump()
)
def fail_resp(
biz_code: BizCode,
msg: str,
http_status: int = 400,
data: Any = None
) -> JSONResponse:
"""返回失败的 JSONResponse"""
return JSONResponse(
status_code=http_status,
content=ApiResponse(code=biz_code, msg=msg, data=data).model_dump()
)
"""
数据验证模块
"""
import re
import logging
logger = logging.getLogger(__name__)
def validate_excel_url(url: str) -> bool:
"""
验证 Excel URL 的有效性
Args:
url: URL 字符串
Returns:
bool: 是否为有效的 Excel URL
"""
if not url or not isinstance(url, str):
return False
# 检查 URL 格式
url_pattern = r'^https?://.*\.(xlsx|xls|csv)$'
if not re.match(url_pattern, url, re.IGNORECASE):
logger.warning(f"URL 格式无效: {url}")
return False
return True
def sanitize_filename(filename: str) -> str:
"""
清理文件名,移除不安全的字符
Args:
filename: 原始文件名
Returns:
str: 清理后的文件名
"""
# 移除不安全字符
sanitized = re.sub(r'[<>:"/\\|?*]', '', filename)
return sanitized[:255] # 限制长度
def validate_table_name(table_name: str) -> bool:
"""
验证数据库表名的有效性
Args:
table_name: 表名
Returns:
bool: 是否为有效的表名
"""
if not table_name or not isinstance(table_name, str):
return False
# MySQL 表名规则:以字母、数字或下划线开头,不包含特殊字符
table_name_pattern = r'^[a-zA-Z_][a-zA-Z0-9_]{0,63}$'
if not re.match(table_name_pattern, table_name):
logger.warning(f"表名格式无效: {table_name}")
return False
return True
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论